Title: | Probability and Statistics with R |
---|---|
Description: | Functions and data sets for the text Probability and Statistics with R. |
Authors: | Alan T. Arnholt [aut, cre] |
Maintainer: | Alan T. Arnholt <[email protected]> |
License: | GPL-2 |
Version: | 1.3 |
Built: | 2025-02-02 02:39:10 UTC |
Source: | https://github.com/alanarnholt/paswr |
Data and functions for the book Probability and Statistics with R
Package: | PASWR |
Type: | Package |
Version: | 1.2 |
Date: | 2016-02-24 |
License: | GPL (>=2) |
Comprehensive and engineering-oriented, Probability and Statistics with R provides a thorough treatment of probability and statistics, clear and accessible real-world examples, and fully detailed proofs. The text provides step-by-step explanations for numerous examples in R and S-PLUS for nearly every topic covered, including both traditional and nonparametric techniques. With a wide range of graphs to illustrate complex material as well as a solutions manual, the book also offers an accompanying website that features supporting information, including datasets, functions, and other downloadable material. It is ideal for undergraduate students and for engineers and scientists who must perform statistical analyses.
Alan T. Arnholt
Maintainer: <[email protected]>
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
Data regarding aggressive behavior in relation to exposure to violent television programs used in Example 10.5
A data frame with 16 observations on the following 2 variables:
violence
(an integer vector)
noviolence
(an integer vector)
This is data regarding aggressive behavior in relation to exposure to violent television programs from Gibbons (1997) with the following exposition:
... a group of children are matched as well as possible as regards home environment, genetic factors, intelligence, parental attitudes, and so forth, in an effort to minimize factors other than TV that might influence a tendency for aggressive behavior. In each of the resulting 16 pairs, one child is randomly selected to view the most violent shows on TV, while the other watches cartoons, situation comedies, and the like. The children are then subjected to a series of tests designed to produce an ordinal measure of their aggression factors. (pages 143-144)
Gibbons, J. D. (1997) Nonparametric Methods for Quantitative Analysis. American Sciences Press.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Aggression, wilcox.test(violence, noviolence, paired = TRUE, alternative = "greater"))
with(data = Aggression, wilcox.test(violence, noviolence, paired = TRUE, alternative = "greater"))
An experiment was undertaken where seventeen recently picked (Fresh
)
apples were randomly selected and measured for hardness. Seventeen apples
were also randomly selected from a warehouse (Warehouse
) where the
apples had been stored for one week. Data are used in Example 8.10.
A data frame with 17 observations on the following 2 variables: #'
Fresh
(hardness rating measured in )
Warehouse
(hardness rating measured in )
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
# Figure 8.5 attach(Apple) par(pty = "s") Altblue <- "#A9E2FF" Adkblue <- "#0080FF" fresh <- qqnorm(Fresh) old <- qqnorm(Warehouse) plot(fresh, type = "n",ylab = "Sample Quantiles", xlab = "Theoretical Quantiles") qqline(Fresh, col = Altblue) qqline(Warehouse, col = Adkblue) points(fresh, col = Altblue, pch = 16, cex = 1.2) points(old, col = Adkblue, pch = 17) legend(-1.75, 9.45, c("Fresh", "Warehouse"), col = c(Altblue, Adkblue), text.col = c("black","black"), pch = c(16, 17), lty = c(1, 1), bg = "gray95", cex = 0.75) title("Q-Q Normal Plots") detach(Apple) # Trellis approach qqmath(~c(Fresh, Warehouse), type = c("p","r"), pch = c(16, 17), cex = 1.2, col=c("#A9E2FF", "#0080FF"), groups=rep(c("Fresh", "Warehouse"), c(length(Fresh), length(Warehouse))), data = Apple, ylab = "Sample Quantiles", xlab = "Theoretical Quantiles")
# Figure 8.5 attach(Apple) par(pty = "s") Altblue <- "#A9E2FF" Adkblue <- "#0080FF" fresh <- qqnorm(Fresh) old <- qqnorm(Warehouse) plot(fresh, type = "n",ylab = "Sample Quantiles", xlab = "Theoretical Quantiles") qqline(Fresh, col = Altblue) qqline(Warehouse, col = Adkblue) points(fresh, col = Altblue, pch = 16, cex = 1.2) points(old, col = Adkblue, pch = 17) legend(-1.75, 9.45, c("Fresh", "Warehouse"), col = c(Altblue, Adkblue), text.col = c("black","black"), pch = c(16, 17), lty = c(1, 1), bg = "gray95", cex = 0.75) title("Q-Q Normal Plots") detach(Apple) # Trellis approach qqmath(~c(Fresh, Warehouse), type = c("p","r"), pch = c(16, 17), cex = 1.2, col=c("#A9E2FF", "#0080FF"), groups=rep(c("Fresh", "Warehouse"), c(length(Fresh), length(Warehouse))), data = Apple, ylab = "Sample Quantiles", xlab = "Theoretical Quantiles")
Size of apartments in Mendebaldea, Spain and San Jorge, Spain
A data frame with 8 observations on the following 2 variables:
Mendebaldea apartment size in square meters
San Jorge apartment size in square meters
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = AptSize, boxplot(Mendebaldea, SanJorge) )
with(data = AptSize, boxplot(Mendebaldea, SanJorge) )
Baseball statistics for George Herman Ruth (The Bambino or The Sultan Of Swat)
A data frame with 22 observations on the following 14 variables.
year in which the season occurred
team he played for Bos-A
, Bos-N
, or
NY-A
games played
at bats
runs scored
hits
doubles
triples
home runs
runs batted in
stolen bases
base on balls or walks
batting average H/AB
slugging percentage (total bases/at bats)
https://www.baseball-reference.com/about/bat_glossary.shtml
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Baberuth, hist(RBI))
with(data = Baberuth, hist(RBI))
Two volunteers each consumed a twelve ounce beer every fifteen minutes for
one hour. One hour after the fourth beer was consumed, each volunteer's
blood alcohol was measured with a different breathalyzer from the same
company. The numbers recorded in data frame Bac
are the sorted blood
alcohol content values reported with breathalyzers from company X
and
company Y
. Data are used in Example 9.15.
A data frame with 10 observations on the following 2 variables:
blood alcohol content measured in g/L
blood alcohol content measured in g/L
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Bac, var.test(X, Y, alternative = "less"))
with(data = Bac, var.test(X, Y, alternative = "less"))
A manufacturer of lithium batteries has two production facilities, A and B.
Fifty randomly selected batteries with an advertised life of 180 hours are
selected, and tested. The lifetimes are stored in (facilityA
). Fifty
ramdomly selected batteries with an advertised life of 200 hours are
selected, and tested. The lifetimes are stored in (facilityB
).
A data frame with 50 observations on the following 2 variables:
life time measured in hours
life time measured in hours
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Battery, qqnorm(facilityA)) with(data = Battery, qqline(facilityA))
with(data = Battery, qqnorm(facilityA)) with(data = Battery, qqline(facilityA))
Function that generates and displays m repeated samples of n Bernoulli trials with a given probability of success.
bino.gen(samples, n, pi)
bino.gen(samples, n, pi)
samples |
number of repeated samples to generate |
n |
number of Bernoulli trials |
pi |
probability of success for Bernoulli trial |
simulated.distribution |
Simulated binomial distribution |
theoretical.distribution |
Theoretical binomial distribution |
Alan T. Arnholt
bino.gen(1000, 20, 0.75)
bino.gen(1000, 20, 0.75)
Several measurements of 42 beech trees (Fagus Sylvatica) taken from a forest in Navarra (Spain).
A data frame with 42 observations on the following 4 variables:
diameter of the stem in centimeters
height of the tree in meters
weight of the stem in kilograms
aboveground weight in kilograms
Gobierno de Navarra and Gestion Ambiental Viveros y Repoblaciones de Navarra, 2006. The data were obtained within the European Project FORSEE.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
plot(log(PSA) ~ log(Dn), data = biomass)
plot(log(PSA) ~ log(Dn), data = biomass)
Values from a study reported in the American Journal of Clinical Nutrition that investigated a new method for measuring body composition
A data frame with 18 observations on the following 3 variables:
age in years
body fat composition
a factor with levels F
for female and M
for male
Mazess, R. B., Peppler, W. W., and Gibbons, M. (1984) Total Body Composition by Dual-Photon (153 Gd) Absorptiometry. American Journal of Clinical Nutrition, 40, 4: 834-839.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
boxplot(fat ~ sex, data = Bodyfat)
boxplot(fat ~ sex, data = Bodyfat)
Mathematical assessment scores for 36 students enrolled in a biostatistics course according to whether or not the students had successfully completed a calculus course prior to enrolling in the biostatistics course
A data frame with 18 observations on the following 2 variables:
assessment score for students with no prior calculus
assessment score for students with prior calculus
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Calculus, z.test(x = Yes.Calculus, y = No.Calculus, sigma.x = 5, sigma.y = 12)$conf )
with(data = Calculus, z.test(x = Yes.Calculus, y = No.Calculus, sigma.x = 5, sigma.y = 12)$conf )
The numbers of cars per 1000 inhabitants (cars
), the total number of
known mortal accidents (deaths
), and the country population/1000
(population
) for the 25 member countries of the European Union for
the year 2004
A data frame with 25 observations on the following 4 variables:
a factor with levels Austria
,
Belgium
, Cyprus
, Czech Republic
, Denmark
,
Estonia
, Finland
, France
, Germany
,
Greece
, Hungary
, Ireland
, Italy
, Latvia
,
Lithuania
, Luxembourg
, Malta
, Netherlands
,
Poland
, Portugal
, Slovakia
, Slovenia
,
Spain
, Sweden
, and United Kingdom
numbers of cars per 1000 inhabitants
total number of known mortal accidents
country population/1000
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
plot(deaths ~ cars, data = Cars2004EU)
plot(deaths ~ cars, data = Cars2004EU)
Function that creates four graphs that can be used to help assess independence, normality, and constant variance
checking.plots(model, n.id = 3, COL = c("#0080FF", "#A9E2FF"))
checking.plots(model, n.id = 3, COL = c("#0080FF", "#A9E2FF"))
model |
an aov or lm object |
n.id |
the number of points to identify |
COL |
vector of two colors |
Alan T. Arnholt <[email protected]>
mod.aov <- aov(StopDist ~ tire, data = Tire) checking.plots(mod.aov) rm(mod.aov)
mod.aov <- aov(StopDist ~ tire, data = Tire) checking.plots(mod.aov) rm(mod.aov)
Two techniques of splitting chips are randomly assigned to 28 sheets so that
each technique is applied to 14 sheets. The values recorded in Chips
are the number of usable chips from each silicon sheet.
A data frame with 14 observations on the following 2 variables:
number of usable chips
number of usable chips
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
par(mfrow = c(1, 2)) with(data = Chips, qqnorm(techniqueI)) with(data = Chips, qqline(techniqueI)) with(data = Chips, qqnorm(techniqueII)) with(data = Chips, qqline(techniqueII)) par(mfrow=c(1, 1)) # Trellis Approach graph1 <- qqmath(~techniqueI, data = Chips, type=c("p", "r")) graph2 <- qqmath(~techniqueII, data = Chips, type=c("p", "r")) print(graph1, split=c(1, 1, 2, 1), more = TRUE) print(graph2, split=c(2, 1, 2, 1), more = FALSE) rm(graph1, graph2)
par(mfrow = c(1, 2)) with(data = Chips, qqnorm(techniqueI)) with(data = Chips, qqline(techniqueI)) with(data = Chips, qqnorm(techniqueII)) with(data = Chips, qqline(techniqueII)) par(mfrow=c(1, 1)) # Trellis Approach graph1 <- qqmath(~techniqueI, data = Chips, type=c("p", "r")) graph2 <- qqmath(~techniqueII, data = Chips, type=c("p", "r")) print(graph1, split=c(1, 1, 2, 1), more = TRUE) print(graph2, split=c(2, 1, 2, 1), more = FALSE) rm(graph1, graph2)
CircuitDesigns
contains the results from an accelerated life test
used to estimate the lifetime of four different circuit designs (lifetimes
in thousands of hours).
A data frame with 26 observations on the following 2 variables:
lifetimes in thousands of hours
a factor with levels Design1
, Design2
,
Design3
, and Design4
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
bwplot(design ~ lifetime, data = CircuitDesigns)
bwplot(design ~ lifetime, data = CircuitDesigns)
This program simulates random samples from which it constructs confidence intervals for either the population mean, the population variance, or the population proportion of successes.
CIsim( samples = 100, n = 30, parameter = 0.5, sigma = 1, conf.level = 0.95, type = c("Mean", "Var", "Pi") )
CIsim( samples = 100, n = 30, parameter = 0.5, sigma = 1, conf.level = 0.95, type = c("Mean", "Var", "Pi") )
samples |
the number of samples desired. |
n |
the size of each sample |
parameter |
If constructing confidence intervals for the population mean or the population variance, parameter is the population mean (i.e., type is one of either |
sigma |
is the population standard deviation. |
conf.level |
confidence level for the graphed confidence intervals, restricted to lie between zero and one |
type |
character string, one of |
Default is to construct confidence intervals for the population mean. Simulated confidence intervals for the population variance or population proportion of successes are possible by selecting the appropriate value in the type
argument.
Performs specified simulation and draws the resulting confidence intervals on a graphical device.
Alan T. Arnholt <[email protected]>
CIsim(samples = 100, n = 30, parameter = 100, sigma = 10, conf.level = 0.90) # Simulates 100 samples of size 30 from a normal distribution with mean 100 # and a standard deviation of 10. From the 100 simulated samples, 90% confidence # intervals for the Mean are constructed and depicted in the graph. CIsim(100, 30, 100, 10, type = "Var") # Simulates 100 sample of size 30 from a normal distribution with mean 100 # and a standard deviation of 10. From the 100 simulated samples, 95% confidence # intervals for the variance are constructed and depicted in the graph. CIsim(100, 50, 0.5, type = "Pi", conf.level = 0.92) # Simulates 100 samples of size 50 from a binomial distribution where the # population proportion of successes is 0.5. From the 100 simulated samples, # 92% confidence intervals for Pi are constructed and depicted in the graph.
CIsim(samples = 100, n = 30, parameter = 100, sigma = 10, conf.level = 0.90) # Simulates 100 samples of size 30 from a normal distribution with mean 100 # and a standard deviation of 10. From the 100 simulated samples, 90% confidence # intervals for the Mean are constructed and depicted in the graph. CIsim(100, 30, 100, 10, type = "Var") # Simulates 100 sample of size 30 from a normal distribution with mean 100 # and a standard deviation of 10. From the 100 simulated samples, 95% confidence # intervals for the variance are constructed and depicted in the graph. CIsim(100, 50, 0.5, type = "Pi", conf.level = 0.92) # Simulates 100 samples of size 50 from a binomial distribution where the # population proportion of successes is 0.5. From the 100 simulated samples, # 92% confidence intervals for Pi are constructed and depicted in the graph.
Computes all possible combinations of n
objects taken k
at a
time.
Combinations(n, k)
Combinations(n, k)
n |
a number |
k |
a number less than or equal to |
Returns a matrix containing the possible combinations of n
objects taken k
at a time.
Combinations(5,2) # The columns in the matrix list the values of the 10 possible # combinations of 5 things taken 2 at a time.
Combinations(5,2) # The columns in the matrix list the values of the 10 possible # combinations of 5 things taken 2 at a time.
The Cosmed is a portable metabolic system. A study at Appalachian State
University compared the metabolic values obtained from the Cosmed to those
of a reference unit (Amatek) over a range of workloads from easy to maximal
to test the validity and reliability of the Cosmed. A small portion of the
results for VO2 (ml/kg/min) measurements taken at a 150 watt workload are
stored in CosAma
.
A data frame with 14 observations on the following 3 variables:
subject number
measured VO2 with Cosmed
measured VO2 with Amatek
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
bwplot(~(Cosmed - Amatek), data = CosAma)
bwplot(~(Cosmed - Amatek), data = CosAma)
Random samples of ten mature (five-year-old and older) and ten two-year-old
cows were taken from each of five breeds. The average butterfat percentage
of these 100 cows is stored in the variable butterfat
with the type
of cow stored in the variable breed
and the age of the cow stored in
the variable age
.
A data frame with 100 observations on the following 3 variables:
average butterfat percentage
a factor with levels 2 years old
and
Mature
a factor with levels Ayrshire
,
Canadian
, Guernsey
, Holstein-Friesian
, and
Jersey
Canadian record book of purebreed diary cattle.
Sokal, R. R. and Rohlf, F. J. (1994) Biometry. W. H. Freeman, New York, third edition.
summary(aov(butterfat ~ breed + age, data = Cows))
summary(aov(butterfat ~ breed + age, data = Cows))
Number of dependent children for 50 families.
A data frame with 50 observations on the following 4 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Kitchens, L. J. (2003) Basic Statistics and Data Analysis. Duxbury
with(data = Depend, table(C1))
with(data = Depend, table(C1))
Drosophila
contains per diem fecundity (number of eggs laid per
female per day for the first 14 days of life) for 25 females from each of
three lines of Drosophila melanogaster. The three lines are
Nonselected (control), Resistant, and Susceptible. Data are used in Example
11.5.
A data frame with 75 observations on the following 2 variables:
number of eggs laid per female per day for the first 14 days of life
a factor with levels Nonselected
, Resistant
, and Susceptible
The original measurements are from an experiment conducted by R. R. Sokal (Sokal and Rohlf, 1994, p. 237).
Sokal, R. R. and Rohlf, F. J. (1994) Biometry. W. H. Freeman, New York, third edition.
summary(aov(Fecundity ~ Line, data = Drosophila))
summary(aov(Fecundity ~ Line, data = Drosophila))
Function that produces a histogram, density plot, boxplot, and Q-Q plot
EDA(x, trim = 0.05, dec = 3)
EDA(x, trim = 0.05, dec = 3)
x |
is a numeric vector where |
trim |
is a fraction (between 0 and 0.5, inclusive) of values to be trimmed from each end of the ordered data such that if |
dec |
is a number specifying the number of decimals |
The function EDA()
will not return console window information on data sets containing more than 5000 observations. It will, however, still produce graphical output for data sets containing more than 5000 observations.
Function returns various measures of center and location. The values returned for the quartiles are based on the default R definitions for quartiles. For more information on the definition of the quartiles, type ?quantile
and read about the algorithm used by type = 7
.
Alan T. Arnholt <[email protected]>
EDA(x = rnorm(100)) # Produces four graphs for the 100 randomly # generated standard normal variates.
EDA(x = rnorm(100)) # Produces four graphs for the 100 randomly # generated standard normal variates.
Salaries for engineering graduates 10 years after graduation
A data frame with 51 observations on the following 2 variables:
salary 10 years after graduation in thousands of dollars
one of three different engineering universities
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
boxplot(salary ~ university, data = Engineer, horizontal = TRUE) # Trellis Approach bwplot(university ~ salary, data = Engineer)
boxplot(salary ~ university, data = Engineer, horizontal = TRUE) # Trellis Approach bwplot(university ~ salary, data = Engineer)
Initial results from a study to determine whether the traditional sitting position or the hamstring stretch position is superior for administering epidural anesthesia to pregnant women in labor as measured by the number of obstructive (needle to bone) contacts (OC)
A data frame with 85 observations on the following 7 variables:
a factor with levels Dr. A
,
Dr. B
, Dr. C
, and Dr. D
weight in kg of patient
height in cm of pateint
a factor with levels Difficult
, Easy
, and Impossible
indicating the physician's assessment of how well bone landmarks can be felt
in the patient
a factor with levels Hamstring Stretch
and Traditional Sitting
number of obstructive contacts
a factor with levels Failure - person got
dizzy
, Failure - too many OCs
, None
, Paresthesia
, and
Wet Tap
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
EPIDURAL$Teasy <- factor(EPIDURAL$Ease, levels = c("Easy", "Difficult", "Impossible")) X <- table(EPIDURAL$Doctor, EPIDURAL$Teasy) X par(mfrow = c(2, 2)) # Figure 2.12 barplot(X, main = "Barplot where Doctor is Stacked \n within Levels of Palpitation") barplot(t(X), main = "Barplot where Levels of Palpitation \n is Stacked within Doctor") barplot(X, beside = TRUE, main = "Barplot where Doctor is Grouped \n within Levels of Palpitation") barplot(t(X), beside = TRUE, main = "Barplot where Levels of Palpitation \n is Grouped within Doctor") par(mfrow = c(1, 1)) rm(X)
EPIDURAL$Teasy <- factor(EPIDURAL$Ease, levels = c("Easy", "Difficult", "Impossible")) X <- table(EPIDURAL$Doctor, EPIDURAL$Teasy) X par(mfrow = c(2, 2)) # Figure 2.12 barplot(X, main = "Barplot where Doctor is Stacked \n within Levels of Palpitation") barplot(t(X), main = "Barplot where Levels of Palpitation \n is Stacked within Doctor") barplot(X, beside = TRUE, main = "Barplot where Doctor is Grouped \n within Levels of Palpitation") barplot(t(X), beside = TRUE, main = "Barplot where Levels of Palpitation \n is Grouped within Doctor") par(mfrow = c(1, 1)) rm(X)
Intermediate results from a study to determine whether the traditional sitting position or the hamstring stretch position is superior for administering epidural anesthesia to pregnant women in labor as measured by the number of obstructive (needle to bone) contacts (OC)
A data frame with 342 observations on the following 7 variables:
a factor with levels Dr. A
, Dr. B
, Dr. C
, and Dr. D
weight in kg of patient
height in cm of pateint
a factor with levels Difficult
, Easy
, and Impossible
indicating the physician's assessment of how well bone landmarks can be felt
in the patient
a factor with levels Hamstring Stretch
and Traditional Sitting
number of obstructive contacts
a factor with levels Failure - person got dizzy
,
Failure - too many OCs
, None
, Paresthesia
, and Wet Tap
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
boxplot(OC ~ Treatment, data = EPIDURALf)
boxplot(OC ~ Treatment, data = EPIDURALf)
A random sample of 15 countries' research and development investments for
the years 2002 and 2003 is taken and the results in millions of euros are
stored in EURD
.
A data frame with 15 observations on the following 3 variables:
a factor with levels Bulgaria
,
Croatia
, Cyprus
, Czech Republic
, Estonia
,
France
, Hungary
, Latvia
, Lithuania
,
Malta
, Portugal
, Romania
, Slovakia
, and
Slovenia
research and development investments in millions of euros for 2002
research and development investments in millions of euros for 2003
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
qqmath(~(RD2003 - RD2002), data = EURD, type=c("p", "r"))
qqmath(~(RD2003 - RD2002), data = EURD, type=c("p", "r"))
The carbon retained by leaves measured in kg/ha is recorded for forty-one different plots of mountainous regions of Navarra (Spain), depending on the forest classification: areas with 90% or more beech trees (Fagus Sylvatica) are labeled monospecific, while areas with many species of trees are labeled multispecific.
A data frame with 41 observations on the following 3 variables:
plot number
carbon retained by leaves measured in kg/ha
a factor with levels monospecific
and multispecific
Gobierno de Navarra and Gestion Ambiental Viveros y Repoblaciones de Navarra, 2006. The data were obtained within the European Project FORSEE.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
boxplot(carbon ~ type, data=fagus)
boxplot(carbon ~ type, data=fagus)
In a weight loss study on obese cats, overweight cats were randomly assigned
to one of three groups and boarded in a kennel. In each of the three
groups, the cats' total caloric intake was strictly controlled (1 cup of
generic cat food) and monitored for 10 days. The difference between the
groups was that group A was given 1/4 of a cup of cat food every six hours,
group B was given 1/3 a cup of cat food every eight hours, and group C was
given 1/2 a cup of cat food every twelve hours. The weight of the cats at
the beginning and end of the study was recorded and the difference in
weights (grams) is stored in the variable Weight
of the data frame
FCD
. Data are used in Example 11.4.
A data frame with 36 observations on the following 2 variables:
difference in weights (grams)
a factor with levels A
, B
, and C
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
# Figure 11.12 FCD.aov <- aov(Weight ~ Diet, data = FCD) checking.plots(FCD.aov) rm(FCD.aov)
# Figure 11.12 FCD.aov <- aov(Weight ~ Diet, data = FCD) checking.plots(FCD.aov) rm(FCD.aov)
Plants' heights in inches obtained from two seeds, one obtained by cross fertilization and the other by auto fertilization, in two opposite but separate locations of a pot are recorded.
A data frame with 15 observations on the following 2 variables:
height of plant in inches
height of plant in inches
Darwin, C. (1876) The Effect of Cross and Self-Fertilization in the Vegetable Kingdom
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Fertilize, t.test(cross, self))
with(data = Fertilize, t.test(cross, self))
Shear measured in kN on frozen carrots from four randomly selected freezers
A data frame with 16 observations on the following 2 variables:
carrot shear measured in kN
a factor with levels A
, B
, C
,
and D
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
summary(aov(shear ~ freezer, data = food))
summary(aov(shear ~ freezer, data = food))
Pit stop times for two teams at 10 randomly selected Formula 1 races
A data frame with 10 observations on the following 3 variables:
number corresponding to a race site
pit stop times for team one
pit stop times for team two
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Formula1, boxplot(Team1, Team2))
with(data = Formula1, boxplot(Team1, Team2))
Contains time until failure in hours for a particular electronic component subjected to an accelerated stress test.
A data frame with 100 observations on the following variable:
times until failure in hours
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = GD, hist(attf, prob = TRUE)) with(data = GD, lines(density(attf))) # Trellis Approach histogram(~attf, data = GD, type = "density", panel = function(x, ...) { panel.histogram(x, ...) panel.densityplot(x, col = "blue", plot.points = TRUE, lwd = 2) } )
with(data = GD, hist(attf, prob = TRUE)) with(data = GD, lines(density(attf))) # Trellis Approach histogram(~attf, data = GD, type = "density", panel = function(x, ...) { panel.histogram(x, ...) panel.densityplot(x, col = "blue", plot.points = TRUE, lwd = 2) } )
Fifteen diabetic patients were randomly selected, and their blood glucose levels were measured in mg/100 ml with two different devices.
A data frame with 15 observations on the following 3 variables:
patient number
blood glucose level in mg/100 ml using old device
blood glucose level in mg/100 ml using new device
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = glucose, boxplot(Old, New))
with(data = glucose, boxplot(Old, New))
The admissions committee of a comprehensive state university selected at
random the records of 200 second semester freshmen. The results, first
semester college GPA and SAT scores, are stored in the data frame
Grades
. Data are used in Example 12.6.
A data frame with 200 observations on the following 2 variables:
SAT score
grade point average
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
# traditional scatterplot plot(gpa ~ sat, data = Grades) # trellis scatterplot xyplot(gpa ~ sat, data = Grades, type=c("p", "smooth"))
# traditional scatterplot plot(gpa ~ sat, data = Grades) # trellis scatterplot xyplot(gpa ~ sat, data = Grades, type=c("p", "smooth"))
The consumer expenditure survey, created by the U.S. Department of Labor,
was administered to 30 households in Watauga County, North Carolina, to see
how the cost of living in Watauga county with respect to total dollars spent
on groceries compares with other counties. The amount of money each
household spent per week on groceries is stored in the variable
groceries
. Data are used in Example 8.3.
A data frame with 30 observations on the following variable:
total dollars spent on groceries
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Grocery, z.test(x = groceries, sigma.x = 30, conf.level = 0.97)$conf)
with(data = Grocery, z.test(x = groceries, sigma.x = 30, conf.level = 0.97)$conf)
Mortality and drinking water hardness for 61 cities in England and Wales.
A data frame with 61 observations on the following 4 variables.
a factor with levels North
South
indicating whether the town is as north as Derby
the name of the town
averaged annual mortality per 100,000 males
calcium concentration (in parts per million)
These data were collected in an investigation of environmental causes of disease. They show the annual mortality rate per 100,000 for males, averaged over the years 1958-1964, and the calcium concentration (in parts per million) in the drinking water supply for 61 large towns in England and Wales. (The higher the calcium concentration, the harder the water.)
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway and E. Ostrowski (1994) A Handbook of Small Datasets. Chapman and Hall/CRC, London.
plot(mortality ~ hardness, data = HardWater)
plot(mortality ~ hardness, data = HardWater)
Random sample of house prices (in thousands of dollars) for three bedroom/two bath houses in Watauga County, NC
A data frame with 14 observations on the following 2 variables:
a factor with levels Blowing Rock
, Cove Creek
,
Green Valley
, Park Valley
, Parkway
, and Valley Crucis
price of house (in thousands of dollars)
with(data = House, t.test(Price))
with(data = House, t.test(Price))
The body fat of 78 high school wrestlers was measured using three separate
techniques, and the results are stored in the data frame HSwrestler
.
The techniques used were hydrostatic weighing (HWFAT
), skin fold
measurements (SKFAT
), and the Tanita body fat scale (TANFAT
).
Data are used in Examples 10.11, 12.11, and 12.12.
A data frame with 78 observations on the following 9 variables:
age of wrestler in years
height of wrestler in inches
weight ofwrestler in pounds
abdominal fat
tricep fat
subscapular fat
hydrostatic fat
Tanita fat
skin fat
Data provided by Dr. Alan Utter, Department of Health Leisure and Exercise Science, Appalachian State University.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
FAT <- c(HSwrestler$HWFAT, HSwrestler$TANFAT, HSwrestler$SKFAT) GROUP <- factor(rep(c("HWFAT", "TANFAT", "SKFAT"), rep(78, 3))) BLOCK <- factor(rep(1:78, 3)) friedman.test(FAT ~ GROUP | BLOCK)
FAT <- c(HSwrestler$HWFAT, HSwrestler$TANFAT, HSwrestler$SKFAT) GROUP <- factor(rep(c("HWFAT", "TANFAT", "SKFAT"), rep(78, 3))) BLOCK <- factor(rep(1:78, 3)) friedman.test(FAT ~ GROUP | BLOCK)
The Hubble Space Telescope was put into orbit on April 25, 1990. Unfortunately, on June 25, 1990, a spherical aberration was discovered in Hubble's primary mirror. To correct this, astronauts had to work in space. To prepare for the mission, two teams of astronauts practiced making repairs under simulated space conditions. Each team of astronauts went through 15 identical scenarios. The times to complete each scenario were recorded in days.
A data frame with 15 observations on the following 2 variables:
days to complete scenario
days to complete scenario
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Hubble, qqnorm(Team1 - Team2)) with(data = Hubble, qqline(Team1 - Team2)) # Trellis Approach qqmath(~(Team1 - Team2), data = Hubble, type=c("p", "r"))
with(data = Hubble, qqnorm(Team1 - Team2)) with(data = Hubble, qqline(Team1 - Team2)) # Trellis Approach qqmath(~(Team1 - Team2), data = Hubble, type=c("p", "r"))
Insurance quotes for two insurers of hazardous waste jobs
A data frame with 15 observations on the following 2 variables:
quotes from company A in euros
quotes from company B in euros
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = InsurQuotes, t.test(companyA, companyB))
with(data = InsurQuotes, t.test(companyA, companyB))
Function to graph intervals
interval.plot(ll, ul, parameter = 0)
interval.plot(ll, ul, parameter = 0)
ll |
vector of lower values |
ul |
vector of upper values |
parameter |
value of the desired parameter (used when graphing confidence intervals) |
Draws user-given intervals on a graphical device.
Alan T. Arnholt <[email protected]>
set.seed(385) samples <- 100 n <- 625 ll <- numeric(samples) ul <- numeric(samples) xbar <- numeric(samples) for (i in 1:samples){ xbar[i] <- mean(rnorm(n, 80, 25)) ll[i] <- xbar[i] - qnorm(.975)*25/sqrt(n) ul[i] <- xbar[i] + qnorm(.975)*25/sqrt(n) } interval.plot(ll, ul, parameter = 80)
set.seed(385) samples <- 100 n <- 625 ll <- numeric(samples) ul <- numeric(samples) xbar <- numeric(samples) for (i in 1:samples){ xbar[i] <- mean(rnorm(n, 80, 25)) ll[i] <- xbar[i] - qnorm(.975)*25/sqrt(n) ul[i] <- xbar[i] + qnorm(.975)*25/sqrt(n) } interval.plot(ll, ul, parameter = 80)
The dataset consists of density and hardness measurements from 36 Australian Eucalypt hardwoods.
A data frame with 36 observations on the following 2 variables.
a measure of density of the timber
the Janka hardness of the timber
Janka Hardness is an importance rating of Australian hardwood timbers. The test measures the force required to imbed a steel ball into a piece of wood.
Williams, E.J. (1959) Regression Analysis. John Wiley & Sons, New York.
with(data = janka, plot(Hardness ~ Density, col = "blue"))
with(data = janka, plot(Hardness ~ Density, col = "blue"))
The data frame Kinder
contains the height in inches and weight in
pounds of 20 children from a kindergarten class. Data are used in Example
12.17.
A data frame with 20 observations on the following 2 variables:
height in inches of child
weight in pounds of child
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
# Figure 12.10 with(data = Kinder, plot(wt, ht)) # Trellis Approach xyplot(ht ~ wt, data = Kinder)
# Figure 12.10 with(data = Kinder, plot(wt, ht)) # Trellis Approach xyplot(ht ~ wt, data = Kinder)
(Kolmogorov-Smirnov)Function to visualize the sampling distribution of (the Kolmogorov-Smirnov one sample statistic) and to find simulated critical values.
ksdist(n = 10, sims = 10000, alpha = 0.05)
ksdist(n = 10, sims = 10000, alpha = 0.05)
n |
sample size |
sims |
number of simulations to perform |
alpha |
desired |
Alan T. Arnholt <[email protected]>
ksdist(n = 10, sims = 15000, alpha =0.05)
ksdist(n = 10, sims = 15000, alpha =0.05)
Function to visualize the sampling distribution of (the Kolmogorov-Smirnov one sample statistic) for simple and composite hypotheses
ksLdist(n = 10, sims = 10000, alpha = 0.05)
ksLdist(n = 10, sims = 10000, alpha = 0.05)
n |
sample size |
sims |
number of simulations to perform |
alpha |
desired |
Alan T. Arnholt <[email protected]>
ksLdist(n = 10, sims = 1500, alpha = 0.05)
ksLdist(n = 10, sims = 1500, alpha = 0.05)
The diameter in millimeters for a random sample of 15 diodes from each of
the two suppliers is stored in the data frame Leddiode
.
A data frame with 15 observations on the following 2 variables:
diameter in millimeters of diodes from supplier A
diameter in millimeters of diodes from supplier B
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Leddiode, boxplot(supplierA, supplierB, col = c("red", "blue")))
with(data = Leddiode, boxplot(supplierA, supplierB, col = c("red", "blue")))
Data set containing the lost revenue in dollars/day and number of workers absent due to illness for a metallurgic company
A data frame with 25 observations on the following 2 variables:
number of absent workers due to illness
lost revenue in dollars
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
xyplot(LostRevenue ~ NumberSick, data = LostR, type=c("p", "r"))
xyplot(LostRevenue ~ NumberSick, data = LostR, type=c("p", "r"))
A plastics manufacturer makes two sizes of milk containers: half gallon and
gallon sizes. The time required for each size to dry is recorded in seconds
in the data frame MilkCarton
.
A data frame with 40 observations on the following 2 variables:
drying time in seconds for half gallon containers
drying time in seconds for whole gallon containers
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = MilkCarton, boxplot(Hgallon, Wgallon))
with(data = MilkCarton, boxplot(Hgallon, Wgallon))
Function that computes and draws the area between two user specified values in a user specified normal distribution with a given mean and standard deviation
normarea(lower = -Inf, upper = Inf, m = 0, sig = 1)
normarea(lower = -Inf, upper = Inf, m = 0, sig = 1)
lower |
the desired lower value |
upper |
the desired upper value |
m |
the mean for the population (default is the standard normal with |
sig |
the standard deviation of the population (default is the standard normal with |
Draws the specified area in a graphics device
Alan T. Arnholt <[email protected]>
# Finds and graphically illustrates P(70 < X < 130) given X is N(100, 15) normarea(lower = 70, upper = 130, m = 100, sig = 15)
# Finds and graphically illustrates P(70 < X < 130) given X is N(100, 15) normarea(lower = 70, upper = 130, m = 100, sig = 15)
Function to determine required sample size to be within a given margin of error
nsize(b, sigma = NULL, p = 0.5, conf.level = 0.95, type = c("mu", "pi"))
nsize(b, sigma = NULL, p = 0.5, conf.level = 0.95, type = c("mu", "pi"))
b |
the desired bound |
sigma |
population standard deviation; not required if using type |
p |
estimate for the population proportion of successes; not required if using type |
conf.level |
confidence level for the problem, restricted to lie between zero and one |
type |
character string, one of |
Answer is based on a normal approximation when using type "pi"
.
Alan T. Arnholt <[email protected]>
nsize(b = 0.015, p = 0.5, conf.level = 0.95, type = "pi") # Returns the required sample size (n) to estimate the population # proportion of successes with a 0.95 confidence interval # so that the margin of error is no more than 0.015 when the # estimate of the population propotion of successes is 0.5. nsize(b = 0.02, sigma = 0.1, conf.level = 0.95, type = "mu") # Returns the required sample size (n) to estimate the population # mean with a 0.95 confidence interval so that the margin # of error is no more than 0.02.
nsize(b = 0.015, p = 0.5, conf.level = 0.95, type = "pi") # Returns the required sample size (n) to estimate the population # proportion of successes with a 0.95 confidence interval # so that the margin of error is no more than 0.015 when the # estimate of the population propotion of successes is 0.5. nsize(b = 0.02, sigma = 0.1, conf.level = 0.95, type = "mu") # Returns the required sample size (n) to estimate the population # mean with a 0.95 confidence interval so that the margin # of error is no more than 0.02.
Q-Q plots of randomly generated normal data of the same sample size as the tested data are generated and plotted on the perimeter of the graph while a Q-Q plot of the actual data is depicted in the center of the graph.
ntester(actual.data)
ntester(actual.data)
actual.data |
is a numeric vector. Missing and infinite values are allowed, but are ignored in the calculation. The length of |
Q-Q plots of randomly generated normal data of the same size as the tested data are generated and plotted on the perimeter of the graph sheet while a Q-Q plot of the actual data is depicted in the center of the graph. The p-values are calculated based on the Shapiro-Wilk W-statistic. Function will only work on numeric vectors containing less than or equal to 5000 observations. Best used for moderate sized samples (n < 50).
Alan T. Arnholt <[email protected]>
Shapiro, S.S. and Wilk, M.B. 1965. An analysis of variance test for normality (complete samples). Biometrika 52: 591-611.
ntester(actual.data = rexp(40, 1)) # Q-Q plot of random exponential data in center plot # surrounded by 8 Q-Q plots of randomly generated # standard normal data of size 40.
ntester(actual.data = rexp(40, 1)) # Q-Q plot of random exponential data in center plot # surrounded by 8 Q-Q plots of randomly generated # standard normal data of size 40.
Function to create dotplots, boxplots, and design plot (means) for single factor designs
oneway.plots(Y, fac1, COL = c("#A9E2FF", "#0080FF"))
oneway.plots(Y, fac1, COL = c("#A9E2FF", "#0080FF"))
Y |
response variable for a single factor design |
fac1 |
predictor variable (factor) |
COL |
a vector with two colors |
Alan T. Arnholt <[email protected]>
with(data = Tire, oneway.plots(StopDist, tire))
with(data = Tire, oneway.plots(StopDist, tire))
The data frame Phenyl
records the level of Q10 at four different
times for 46 patients diagnosed with phenylketonuria. The variable
Q10.1
contains the level of Q10 measured in micromoles for the 46
patients. Q10.2
, Q10.3
, and Q10.4
are the values
recorded at later times respectively for the 46 patients.
A data frame with 46 observations on the following 4 variables.
level of Q10 at time 1 in micromoles
level of Q10 at time 2 in micromoles
level of Q10 at time 3 in micromoles
level of Q10 at time 4 in micromoles
Phenylketonuria (PKU) is a genetic disorder that is characterized by an inability of the body to utilize the essential amino acid, phenylalanine. Research suggests patients with phenylketonuria have deficiencies in coenzyme Q10.
Artuch, R., et. al. (2004) “Study of Antioxidant Status in Phenylketonuric Patients.” Clinical Biochemistry, 37: 198-203.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Phenyl, t.test(Q10.1, conf.level = 0.99))
with(data = Phenyl, t.test(Q10.1, conf.level = 0.99))
Phone
contains times in minutes of long distance telephone calls
during a one month period for a small business. Data are used in Example
10.1.
A data frame with 23 observations on the following variable:
time spent on long distance calls in minutes
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Phone, SIGN.test(call.time, md = 2.1))
with(data = Phone, SIGN.test(call.time, md = 2.1))
The survival time in weeks of 20 male rats exposed to high levels of radiation.
A data frame with 20 observations on the following variable:
number of weeks survived
Lawless, J. (1982) Statistical Models and Methods for Lifetime Data. John Wiley, New York.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Rat, EDA(survival.time))
with(data = Rat, EDA(survival.time))
Twelve rats were chosen, and a drug was administered to six rats, the
treatment group, chosen at random. The other six rats, the control group,
received a placebo. The drops in blood pressure (mmHg) for the treatment
group (with probability distribution F) and the control group (with
probability distribution G) are stored in the variables Treat
and
Cont
, respectively. Data are used in Example 10.18.
A data frame with 6 observations on the following 2 variables:
drops in blood pressure in mmHg for treatment group
drops in blood pressure in mmHg for control group
The data is originally from Ott and Mendenhall (1985, problem 8.17).
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Ratbp, boxplot(Treat, Cont))
with(data = Ratbp, boxplot(Treat, Cont))
Thirty 18 cubic feet refrigerators were randomly selected from a company's
warehouse. The first fifteen had their motors modified while the last
fifteen were left intact. The energy consumption (kilowatts) for a 24 hour
period for each refrigerator was recorded and stored in the data frame
Refrigerator
. The refrigerators with the design modification are
stored in the variable modelA
, and those without the design
modification are stored in the variable modelB
.
A data frame with 30 observations on the following 2 variables.
energy consumption in kilowatts for a 24 hour period
energy consumption in kilowatts for a 24 hour period
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Refrigerator, boxplot(modelA, modelB))
with(data = Refrigerator, boxplot(modelA, modelB))
A laboratory is interested in testing a new child friendly pesticide on
Blatta orientalis (oriental cockroaches). Scientists apply the new
pesticide to 81 randomly selected Blatta orientalis oothecae (eggs). The
results from the experiment are stored in the data frame Roacheggs
in
the variable eggs
. A zero in the variable eggs
indicates that
nothing hatched from the egg while a 1 indicates the birth of a cockroach.
Data is used in Example 7.16.
A data frame with 81 observations on the following variable:
numeric vector where a 0 indicates nothing hatched while a 1 indicates the birth of a cockroach.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
p <- seq(0.1, 0.9, 0.001) negloglike <- function(p){ -(sum(Roacheggs$eggs)*log(p) + sum(1 - Roacheggs$eggs)*log(1 - p)) } nlm(negloglike, 0.2) rm(negloglike)
p <- seq(0.1, 0.9, 0.001) negloglike <- function(p){ -(sum(Roacheggs$eggs)*log(p) + sum(1 - Roacheggs$eggs)*log(1 - p)) } nlm(negloglike, 0.2) rm(negloglike)
Surface-water salinity measurements were taken in a bottom-sampling project
in Whitewater Bay, Florida. These data are stored in the data frame
Salinity
.
A data frame with 48 observations on the following variable:
surface-water salinity measurements
Davis, J. (1986) Statistics and Data Analysis in Geology. John Wiley, New York.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Salinity, EDA(salinity))
with(data = Salinity, EDA(salinity))
To estimate the total surface occupied by fruit trees in 3 small areas (R63, R67, and R68) of Navarra (Spain) in 2001, a sample of 47 square segments has been taken. The experimental units are square segments or quadrats of 4 hectares, obtained by random sampling after overlaying a square grid on the study domain. Data are used in Case Study: Fruit Trees, Chapter 12.
A data frame with 47 observations on the following 17 variables:
number of the sampled segment or quadrat
the small area, a factor with levels R63
,
R67
, and R68
area classified as wheat in sampled segment
area classified as barley in sampled segment
area classified as non arable in sampled segment
area classified as corn in sampled segment
area classified as sunflower in sampled segment
area classified as vineyard in sampled segment
area classified as grass in sampled segment
area classified as asparagus in sampled segment
area classified as lucerne in sampled segment
area classified as rape (Brassica Napus) in sampled segment
area classified as rice in sampled segment
area classified as almonds in sampled segment
area classified as olives in sampled segment
area classified as fruit trees in sampled segment
the observed area of fruit trees in sampled segment
Militino, A. F., et. al. (2006) “Using Small Area Models to Estimate the Total Area Occupied by Olive Trees.” Journal of Agricultural, Biological and Environmental Statistics, 11: 450-461.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = satfruit, pairs(satfruit[ , 15:17])) # Trellis Approach splom(~data.frame(satfruit[ , 15:17]), data = satfruit)
with(data = satfruit, pairs(satfruit[ , 15:17])) # Trellis Approach splom(~data.frame(satfruit[ , 15:17]), data = satfruit)
A school psychologist administered the Stanford-Binet intelligence quotient (IQ) test in two counties. Forty randomly selected, gifted and talented students were selected from each county. The Stanford-Binet IQ test is said to follow a normal distribution with a mean of 100 and standard deviation of 16.
A data frame with 40 observations on the following 2 variables:
IQ scores for county one
IQ scores for county two
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = SBIQ, qqnorm(County1)) with(data = SBIQ, qqline(County1)) # Trellis Approach qqmath(~County1, data = SBIQ, type=c("p", "r"))
with(data = SBIQ, qqnorm(County1)) with(data = SBIQ, qqline(County1)) # Trellis Approach qqmath(~County1, data = SBIQ, type=c("p", "r"))
Twenty-five patients with schizophrenia were classified as psychotic or nonpsychotic after being treated with an antipsychotic drug. Samples of cerebral fluid were taken from each patient and assayed for dopamine b-hydroxylase (DBH) activity. The dopamine measurements for the two groups are in nmol/(ml)(h)/(mg) of protein.
A data frame with 15 observations on the following 2 variables:
dopamine activity level for patients classified nonpsychotic
dopamine activity level for patients classified psychotic
Sternberg, D. E., Van Kammen, D. P., and Bunney,W. E. (1982) “Schizophrenia: Dopamine b-Hydroxylase Activity and Treatment Response.” Science, 216: 1423-1425.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Schizo, boxplot(nonpsychotic, psychotic, names = c("nonpsychotic", "psychotic"), col = c("green", "red")))
with(data = Schizo, boxplot(nonpsychotic, psychotic, names = c("nonpsychotic", "psychotic"), col = c("green", "red")))
Standardized test scores from a random sample of twenty college freshmen.
A data frame with 20 observations on the following variable:
standardized test score
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
qqmath(~scores, data = Score, type=c("p", "r"))
qqmath(~scores, data = Score, type=c("p", "r"))
The times recorded are those for 41 successive vehicles travelling
northwards along the M1 motorway in England when passing a fixed point near
Junction 13 in Bedfordshire on Saturday, March 23, 1985. After subtracting
the times, the following 40 interarrival times reported to the nearest
second are stored in SDS4
under the variable Times
. Data are
used in Example 10.17.
A data frame with 40 observations on the following variable:
interarrival times to the nearest second
Hand, D. J., et. al. (1994) A Handbook of Small Data Sets. Chapman & Hall, London.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = SDS4, hist(Times))
with(data = SDS4, hist(Times))
This function will test a hypothesis based on the sign test and reports linearly interpolated confidence intervals for one sample problems.
SIGN.test( x, y = NULL, md = 0, alternative = "two.sided", conf.level = 0.95, ... )
SIGN.test( x, y = NULL, md = 0, alternative = "two.sided", conf.level = 0.95, ... )
x |
numeric vector; |
y |
optional numeric vector; |
md |
a single number representing the value of the population median specified by the null hypothesis |
alternative |
is a character string, one of |
conf.level |
confidence level for the returned confidence interval, restricted to lie between zero and one |
... |
further arguments to be passed to or from methods |
Computes a “Dependent-samples Sign-Test” if both x
and y
are provided. If only x
is provided, computes the “Sign-Test.”
A list of class htest_S
, containing the following components:
statistic |
the S-statistic (the number of positive differences between the data and the hypothesized median), with names attribute “S”. |
p.value |
the p-value for the test |
conf.int |
is a confidence interval (vector of length 2) for the true
median based on linear interpolation. The confidence level is recorded in the attribute
|
estimate |
is avector of length 1, giving the sample median; this
estimates the corresponding population parameter. Component |
null.value |
is the value of the median specified by the null hypothesis.
This equals the input argument |
alternative |
records the value of the input argument alternative:
|
data.name |
a character string (vector of length 1)
containing the actual name of the input vector |
Confidence.Intervals |
a 3 by 3 matrix containing the lower achieved confidence interval, the interpolated confidence interval, and the upper achived confidence interval |
For the one-sample sign-test, the null hypothesis
is that the median of the population from which x
is drawn is
md
. For the two-sample dependent case, the null hypothesis is that
the median for the differences of the populations from which x
and
y
are drawn is md
. The alternative hypothesis indicates the
direction of divergence of the population median for x
from md
(i.e., "greater"
, "less"
, "two.sided"
.)
The median test assumes the parent population is continuous.
The reported confidence interval is based on linear interpolation. The lower and upper confidence levels are exact.
Alan T. Arnholt <[email protected]>
Gibbons, J.D. and Chakraborti, S. 1992. Nonparametric Statistical Inference. Marcel Dekker Inc., New York.
Kitchens, L.J. 2003. Basic Statistics and Data Analysis. Duxbury.
Conover, W. J. 1980. Practical Nonparametric Statistics, 2nd ed. Wiley, New York.
Lehmann, E. L. 1975. Nonparametrics: Statistical Methods Based on Ranks. Holden and Day, San Francisco.
with(data = Phone, SIGN.test(call.time, md = 2.1)) # Computes two-sided sign-test for the null hypothesis # that the population median is 2.1. The alternative # hypothesis is that the median is not 2.1. An interpolated # upper 95% upper bound for the population median will be computed.
with(data = Phone, SIGN.test(call.time, md = 2.1)) # Computes two-sided sign-test for the null hypothesis # that the population median is 2.1. The alternative # hypothesis is that the median is not 2.1. An interpolated # upper 95% upper bound for the population median will be computed.
Simulated data for five variables. Data are used with Example 12.21.
A data frame with 200 observations on the following 5 variables:
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
xyplot(Y1 ~ x1, data = SimDataST, type=c("p", "smooth"))
xyplot(Y1 ~ x1, data = SimDataST, type=c("p", "smooth"))
Simulated data for four varaibles. Data are used with Example 12.18.
A data frame with 200 observations on the following 4 variables:
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
xyplot(Y ~ x1, data = SimDataXT, type=c("p", "smooth"))
xyplot(Y ~ x1, data = SimDataXT, type=c("p", "smooth"))
Soccer
contains how many goals were scored in the regulation 90
minute periods of World Cup soccer matches from 1990 to 2002. Data are used
in Example 4.4.
A data frame with 575 observations on the following 3 variables:
cumulative goal time in minutes
game in which goals were scored
number of goals scored in regulation period
The World Cup is played once every four years. National teams from all over
the world compete. In 2002 and in 1998, thirty-six teams were invited;
whereas, in 1994 and in 1990, only 24 teams participated. The data frame
Soccer
contains three columns: CGT
, Game
, and
Goals
. All of the information contained in Soccer
is
indirectly available from the FIFA World Cup website, located at
https://www.fifa.com/.
Chu, S. (2003) “Using Soccer Goals to Motivate the Poisson Process.” INFORMS Transaction on Education, 3, 2: 62-68.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Soccer, table(Goals))
with(data = Soccer, table(Goals))
Computes all possible samples from a given population using simple random sampling
SRS(popvalues, n)
SRS(popvalues, n)
popvalues |
are values of the population. |
n |
the sample size |
If non-finite values are entered as part of the population, they are removed; and the returned simple random sample computed is based on the remaining finite values.
The function srs()
returns a matrix containing the possible simple random samples of size n
taken from a population of finite values popvalues
.
Alan T. Arnholt <[email protected]>
SRS(popvalues = c(5, 8, 3, NA, Inf), n = 2)
SRS(popvalues = c(5, 8, 3, NA, Inf), n = 2)
In a study conducted at Appalachian State University, students used digital
oral thermometers to record their temperatures each day they came to class.
A randomly selected day of student temperatures is provided in
StatTemps
. Information is also provided with regard to subject gender
and the hour of the day when the students' temperatures were measured.
A data frame with 34 observations on the following 3 variables:
temperature in farenheit
a factor with levels Female
and Male
a factor with levels 8 a.m.
and 9 a.m.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
bwplot(gender ~ temperature, data = StatTemps)
bwplot(gender ~ temperature, data = StatTemps)
A questionnaire is randomly administered to 11 students from State School
X
and to 15 students from State School Y
(the results have
been ordered and stored in the data frame Stschool
). Data are used
in Example 9.11.
A data frame with 26 observations on the following 4 variables:
satisfaction score
satisfaction score
combined satisfaction scores
a factor with levels X
and Y
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Stschool, t.test(X, Y, var.equal = TRUE))
with(data = Stschool, t.test(X, Y, var.equal = TRUE))
To compare the speed differences between two different brands of
workstations (Sun and Digital), the times each brand took to complete
complex simulations was recorded. Five complex simulations were selected,
and the five selected simulations were run on both workstations. The
resulting times in minutes for the five simulations are stored in data frame
Sundig
.
A data frame with 5 observations on the following 3 variables:
time in seconds for a Sun workstation to complete a simulation
time in seconds for a Digital workstation to complete a simulation
difference between Sun
and Digital
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Sundig, t.test(SUN, DIGITAL, paired = TRUE)$conf)
with(data = Sundig, t.test(SUN, DIGITAL, paired = TRUE)$conf)
Seventy-two field trials were conducted by applying four defoliation
treatments (non defoliated control, 33%, 66%, and 100%) at different
growth stages (stage
) ranging from pre-flowering (1) to physiological
maturity (5) in four different locations of Navarra, Spain: Carcastillo (1),
Melida (2), Murillo (3), and Unciti (4). There are two response variables:
yield
in kg/ha of the sunflower and numseed
, the number of
seeds per sunflower head. Data are stored in the data frame
sunflower
. Data used in Case Study: Sunflower defoliation from
Chapter 11.
A data frame with 72 observations on the following 5 variables:
a factor with levels A
, B
,
C
, and D
for locations Carcastillo, Melida, Murillo, and
Unciti respectively
a factor with levels stage1
, stage2
, stage3
, stage4
, and
stage5
a factor with levels control
,
treat1
, treat2
, and treat3
sunflower yield in kg/ha
number of seeds per sunflower head
Muro, J., et. al. (2001) “Defoliation Effects on Sunflower Yield Reduction.” Agronomy Journal, 93: 634-637.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
summary(aov(yield ~ stage + defoli + stage:defoli, data = sunflower))
summary(aov(yield ~ stage + defoli + stage:defoli, data = sunflower))
Surface area (km2) for seventeen autonomous
Spanish communities.
A data frame with 17 observations on the following 2 variables:
a factor with levels Andalucia
,
Aragon
, Asturias
, Baleares
, C.Valenciana
,
Canarias
, Cantabria
, Castilla-La Mancha
,
Castilla-Leon
, Cataluna
, Extremadura
, Galicia
,
La Rioja
, Madrid
, Murcia
, Navarra
, and
P.Vasco
surface area in km2
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = SurfaceSpain, barplot(surface, names.arg = community, las = 2)) # Trellis Approach barchart(community ~ surface, data = SurfaceSpain)
with(data = SurfaceSpain, barplot(surface, names.arg = community, las = 2)) # Trellis Approach barchart(community ~ surface, data = SurfaceSpain)
Swimmers' improvements in seconds for two diets are stored in the data frame
Swimtimes
. The values in highfat
represent the time
improvement in seconds for swimmers on a high fat diet, and the values in
lowfat
represent the time improvement in seconds for swimmers on a
low fat diet. Data are used in Example 10.9.
A data frame with 14 observations on the following 2 variables:
time improvement in seconds
time improvement in seconds
Times for the thirty-two swimmers for the 200 yard individual medley were taken right after the swimmers' conference meet. The swimmers were randomly assigned to follow one of the diets. The group on diet 1 followed a low fat diet the entire year but lost two swimmers along the way. The group on diet 2 followed the high fat diet the entire year and also lost two swimmers.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Swimtimes, wilcox.test(highfat, lowfat))
with(data = Swimtimes, wilcox.test(highfat, lowfat))
The Yonalasee tennis club has two systems to measure the speed of a tennis
ball. The local tennis pro suspects one system (Speed1) consistently records
faster speeds. To test her suspicions, she sets up both systems and records
the speeds of 12 serves (three serves from each side of the court). The
values are stored in the data frame Tennis
in the variables
Speed1
and Speed2
. The recorded speeds are in kilometers per
hour.
A data frame with 12 observations on the following 2 variables:
speed in kilometers per hour
speed in kilometers per hour
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Tennis, boxplot(Speed1, Speed2))
with(data = Tennis, boxplot(Speed1, Speed2))
Test grades of 29 students taking a basic statistics course
A data frame with 29 observations on the following variable:
test score
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = TestScores, EDA(grade))
with(data = TestScores, EDA(grade))
The data frame Tire
has the stopping distances measured to the
nearest foot for a standard sized car to come to a complete stop from a
speed of sixty miles per hour. There are six measurements of the stopping
distance for four different tread patterns labeled A, B, C, and D. The same
driver and car were used for all twenty-four measurements. Data are used in
Example 11.1 and 11.2.
A data frame with 24 observations on the following 2 variables:
stopping distance measured to the nearest foot
a factor with levels A
, B
, C
,
and D
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
summary(aov(StopDist ~ tire, data = Tire))
summary(aov(StopDist ~ tire, data = Tire))
The data frame TireWear
contains measurements for the amount of tread
loss after 10,000 miles of driving in thousandths of an inch. Data are used
in Example 11.8.
A data frame with 16 observations on the following 3 variables:
tread loss measured in thousandths of an inch
a factor with levels A
, B
,
C
, and D
a factor with levels Car1
, Car2
, Car3
, and Car4
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
par(mfrow = c(1, 2), cex = 0.8) with(data = TireWear, interaction.plot(Treat, Block, Wear, type = "b", legend = FALSE)) with(data = TireWear, interaction.plot(Block, Treat, Wear, type = "b", legend = FALSE)) par(mfrow = c(1, 1), cex = 1)
par(mfrow = c(1, 2), cex = 0.8) with(data = TireWear, interaction.plot(Treat, Block, Wear, type = "b", legend = FALSE)) with(data = TireWear, interaction.plot(Block, Treat, Wear, type = "b", legend = FALSE)) par(mfrow = c(1, 1), cex = 1)
The titanic3
data frame describes the survival status of individual
passengers on the Titanic. The titanic3
data frame does not contain
information for the crew, but it does contain actual and estimated ages for
almost 80% of the passengers.
A data frame with 1309 observations on the following 14 variables:
a factor with levels 1st
,
2nd
, and 3rd
Survival (0 = No; 1 = Yes)
Name
a factor with levels
female
and male
age in years
Number of Siblings/Spouses Aboard
Number of Parents/Children Aboard
Ticket Number
Passenger Fare
Cabin
a factor with levels
Cherbourg
, Queenstown
, and Southampton
Lifeboat
Body IdentificationNumber
Home/Destination
Thomas Cason of UVa has greatly updated and improved the titanic
data
frame using the Encyclopedia Titanica and created a new dataset
called titanic3
. This dataset reflects the state of data available as
of August 2, 1999. Some duplicate passengers have been dropped, many errors
have been corrected, many missing ages have been filled in, and new
variables have been created.
https://hbiostat.org/data/repo/titanic.html
Harrell, F. E. (2001) Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer.
with(titanic3, table(pclass, sex))
with(titanic3, table(pclass, sex))
Nuclear energy (in TOE, tons of oil equivalent) produced in 12 randomly selected European countries during 2003
A data frame with 12 observations on the following variable:
nuclear energy measured in tons of oil equivalent
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(TOE, plot(density(energy)))
with(TOE, plot(density(energy)))
Top20
contains data (in millions of dollars) corresponding to the
earnings of 15 randomly selected tennis players whose earnings fall
somewhere in positions 20 through 100 of ranked earnings.
A data frame with 15 observations on the following variable:
yearly income in millions of dollars
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Top20, EDA(income))
with(data = Top20, EDA(income))
Performs a one-sample, two-sample, or a Welch modified two-sample t-test based on user supplied summary information. Output is identical to that produced with t.test
.
tsum.test( mean.x, s.x = NULL, n.x = NULL, mean.y = NULL, s.y = NULL, n.y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, var.equal = FALSE, conf.level = 0.95, ... )
tsum.test( mean.x, s.x = NULL, n.x = NULL, mean.y = NULL, s.y = NULL, n.y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, var.equal = FALSE, conf.level = 0.95, ... )
mean.x |
a single number representing the sample mean of |
s.x |
a single number representing the sample standard deviation of |
n.x |
a single number representing the sample size of |
mean.y |
a single number representing the sample mean of |
s.y |
a single number representing the sample standard deviation of |
n.y |
a single number representing the sample size of |
alternative |
is a character string, one of |
mu |
is a single number representing the value of the mean or difference in means specified by the null hypothesis. |
var.equal |
logical flag: if |
conf.level |
is the confidence level for the returned confidence interval; it must lie between zero and one. |
... |
Other arguments passed onto |
If y
is NULL
, a one-sample t-test is carried out with x
. If y
is not NULL
, either a standard or Welch modified two-sample t-test is performed, depending on whether var.equal
is TRUE
or FALSE
.
A list of class htest
, containing the following components:
statistic |
the t-statistic, with names attribute |
parameters |
is the degrees of freedom of the t-distribution associated with statistic. Component |
p.value |
the p-value for the test |
conf.int |
is a confidence interval (vector of length 2) for the true mean or difference in means. The confidence level is recorded in the attribute |
estimate |
is a vector of length 1 or 2, giving the sample mean(s) or mean of differences; these estimate the corresponding population parameters. Component |
null.value |
is the value of the mean or difference in means specified by the null hypothesis. This equals the input argument |
alternative |
records the value of the input argument alternative: |
data.name |
is a character string (vector of length 1) containing the names x and y for the two summarized samples. |
For the one-sample t-test, the null hypothesis is that the mean of the population from which x
is drawn is mu
. For the standard and Welch modified two-sample t-tests, the null hypothesis is that the population mean for x
less that for y
is mu
.
The alternative hypothesis in each case indicates the direction of divergence of the population mean for x
(or difference of means for x
and y
) from mu
(i.e., "greater"
, "less"
, or "two.sided"
).
The assumption of equal population variances is central to the standard two-sample t-test. This test can be misleading when population variances are not equal, as the null distribution of the test statistic is no longer a t-distribution. If the assumption of equal variances is doubtful with respect to a particular dataset, the Welch modification of the t-test should be used.
The t-test and the associated confidence interval are quite robust with respect to level toward heavy-tailed non-Gaussian distributions (e.g., data with outliers). However, the t-test is non-robust with respect to power, and the confidence interval is non-robust with respect to average length, toward these same types of distributions.
For each of the above tests, an expression for the related confidence interval (returned component conf.int
) can be obtained in the usual way by inverting the expression for the test statistic. Note that, as explained under the description of conf.int
, the confidence interval will be half-infinite when alternative is not "two.sided"
; infinity will be represented by Inf
.
Alan T. Arnholt <[email protected]>
Kitchens, L.J. 2003. Basic Statistics and Data Analysis. Duxbury.
Hogg, R. V. and Craig, A. T. 1970. Introduction to Mathematical Statistics, 3rd ed. Toronto, Canada: Macmillan.
Mood, A. M., Graybill, F. A. and Boes, D. C. 1974. Introduction to the Theory of Statistics, 3rd ed. New York: McGraw-Hill.
Snedecor, G. W. and Cochran, W. G. 1980. Statistical Methods, 7th ed. Ames, Iowa: Iowa State University Press.
# 95% Confidence Interval for mu1 - mu2, assuming equal variances round(tsum.test(mean.x = 53/15, mean.y = 77/11, s.x=sqrt((222 - 15*(53/15)^2)/14), s.y = sqrt((560 - 11*(77/11)^2)/10), n.x = 15, n.y = 11, var.equal = TRUE)$conf, 2) # One Sample t-test tsum.test(mean.x = 4, s.x = 2.89, n.x = 25, mu = 2.5)
# 95% Confidence Interval for mu1 - mu2, assuming equal variances round(tsum.test(mean.x = 53/15, mean.y = 77/11, s.x=sqrt((222 - 15*(53/15)^2)/14), s.y = sqrt((560 - 11*(77/11)^2)/10), n.x = 15, n.y = 11, var.equal = TRUE)$conf, 2) # One Sample t-test tsum.test(mean.x = 4, s.x = 2.89, n.x = 25, mu = 2.5)
Function creates side-by-side boxplots for each factor, a design plot (means), and an interaction plot.
twoway.plots(Y, fac1, fac2, COL = c("#A9E2FF", "#0080FF"))
twoway.plots(Y, fac1, fac2, COL = c("#A9E2FF", "#0080FF"))
Y |
response variable |
fac1 |
factor one |
fac2 |
factor two |
COL |
a vector with two colors |
Alan T. Arnholt <[email protected]>
with(data = TireWear, twoway.plots(Wear, Treat, Block))
with(data = TireWear, twoway.plots(Wear, Treat, Block))
The manager of a URL commercial address is interested in predicting the
number of megabytes downloaded, megasd
, by clients according to the
number minutes they are connected, mconnected
. The manager randomly
selects (megabyte, minute) pairs, and records the data. The pairs
(megasd
, mconnected
) are stored in the data frame
URLaddress
.
A data frame with 30 observations on the following 2 variables:
megabytes dowloaded
number of minutes connected
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
xyplot(mconnected ~ megasd, data = URLaddress, type=c("p", "r"))
xyplot(mconnected ~ megasd, data = URLaddress, type=c("p", "r"))
Descriptive information and the appraised total price (in Euros) for apartments in Vitoria, Spain.
A data frame with 218 observations on the following 16 variables:
the number of the observation
the market total price (in Euros) of the apartment including garage(s) and storage room(s)
the total living area of the apartment in square meters
a factor
indicating the neighborhood where the apartment is located with levels
Z11
, Z21
, Z31
, Z32
, Z34
, Z35
,
Z36
, Z37
, Z38
, Z41
, Z42
, Z43
,
Z44
, Z45
, Z46
, Z47
, Z48
, Z49
,
Z52
, Z53
, Z56
, Z61
, and Z62
.
a factor indicating the condition of the apartment
with levels 2A
, 2B
, 3A
, 3B
, 4A
,
4B
, and 5A
. The factors are ordered so that 2A
is the
best and 5A
is the worst.
age of the aprtment
floor on which the apartment is located
total number of rooms including bedrooms, dining room, and kitchen
a factor indicating the percent of the
apartment exposed to the elements. The levels E100
, E75
,
E50
, and E25
, correspond to complete exposure, 75% exposure,
50% exposure, and 25% exposure respectively.
is an ordered factor indicating the state of
conservation of the apartment. The levels 1A
, 2A
, 2B
,
and 3A
are ordered from best to worst conservation.
the number of bathrooms
the number of garages
indicates the absence (0) or presence (1) of elevators.
an ordered factor
from best to worst indicating the category of the street with levels
S2
, S3
, S4
, and S5
a
factor indicating the type of heating with levels 1A
, 3A
,
3B
, and 4A
which correspond to: no heating, low-standard
private heating, high-standard private heating, and central heating
respectively.
the number of storage rooms outside of the apartment
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
modTotal <- lm(totalprice ~ area + as.factor(elevator) + area:as.factor(elevator), data = vit2005) modSimpl <- lm(totalprice ~ area, data = vit2005) anova(modSimpl,modTotal) rm(modSimpl, modTotal)
modTotal <- lm(totalprice ~ area + as.factor(elevator) + area:as.factor(elevator), data = vit2005) modSimpl <- lm(totalprice ~ area, data = vit2005) anova(modSimpl,modTotal) rm(modSimpl, modTotal)
A statistician records how long he must wait for his bus each morning. Data are used in Example 10.4.
A data frame with 15 observations on the following variable:
waiting time in minutes
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Wait, wilcox.test(wt, mu = 6, alternative = "less"))
with(data = Wait, wilcox.test(wt, mu = 6, alternative = "less"))
Diameter of washers.
A data frame with 20 observations on the following variable:
diameter of washer in cm
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Washer, EDA(diameters))
with(data = Washer, EDA(diameters))
An independent agency measures the sodium content in 20 samples from source
X
and in 10 samples from source Y
and stores them in data
frame Water
. Data are used in Example 9.12.
A data frame with 30 observations on the following 4 variables:
sodium content measured in mg/L
sodium content measured in mg/L
combined sodium content measured in mg/L
a factor with levels X
and Y
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Water, t.test(X, Y, alternative = "less"))
with(data = Water, t.test(X, Y, alternative = "less"))
The following data are the test scores from a group of 50 patients from the Virgen del Camino Hospital (Pamplona, Spain) on the Wisconsin Card Sorting Test.
A data frame with 50 observations on the following variable:
score on the Wisconsin Card Sorting Test
The “Wisconsin Card Sorting Test” is widely used by psychiatrists, neurologists, and neuropsychologists with patients who have a brain injury, neurodegenerative disease, or a mental illness such as schizophrenia. Patients with any sort of frontal lobe lesion generally do poorly on the test.
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
densityplot(~score, data = WCST, ref = TRUE)
densityplot(~score, data = WCST, ref = TRUE)
The data come from an experiment to study the gain in weight of rats fed on four different diets, distinguished by amount of protein (low and high) and by source of pretein (beef and cereal).
A data frame with 40 observations on the following 4 variables.
a factor with levels Beef
Cereal
a factor with levels High
Low
weight gain in grams
The design of the experiment is acompletely randomized with ten rats on each of the four treatments.
D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway and E. Ostrowski (1994) A Handbook of Small Datasets. Chapman and Hall/CRC, London.
aov(weightgain ~ ProteinSource*ProteinAmount, data = WeightGain)
aov(weightgain ~ ProteinSource*ProteinAmount, data = WeightGain)
Seventeen Spanish communities and their corresponding surface area (in hecatares) dedicated to growing wheat
A data frame with 17 observations on the following 3 variables:
a factor with levels Andalucia
,
Aragon
, Asturias
, Baleares
, C.Valenciana
,
Canarias
, Cantabria
, Castilla-La Mancha
,
Castilla-Leon
, Cataluna
, Extremadura
, Galicia
,
La Rioja
, Madrid
, Murcia
, Navarra
, and
P.Vasco
surface area measured in hectares
surface area measured in acres
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = WheatSpain, boxplot(hectares))
with(data = WheatSpain, boxplot(hectares))
USA's 2004 harvested wheat surface by state
A data frame with 30 observations on the following 2 variables.
a factor with levels AR
, CA
,
CO
, DE
, GA
, ID
, IL
, IN
, KS
,
KY
, MD
, MI
, MO
, MS
, MT
, NC
,
NE
, NY
, OH
, OK
, OR
, Other
,
PA
, SC
, SD
, TN
, TX
, VA
, WA
,
and WI
wheat surface area measured in 1000s of acres
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = wheatUSA2004, hist(ACRES))
with(data = wheatUSA2004, hist(ACRES))
Performs exact one sample and two sample Wilcoxon tests on vectors of data
wilcoxE.test( x, y = NULL, mu = 0, paired = FALSE, alternative = c("two.sided", "less", "greater"), conf.level = 0.95 )
wilcoxE.test( x, y = NULL, mu = 0, paired = FALSE, alternative = c("two.sided", "less", "greater"), conf.level = 0.95 )
x |
is a numeric vector of data values. Non-finite (i.e. infinite or missing) values will be omitted. |
y |
an optional numeric vector of data values |
mu |
a number specifying an optional parameter used to form the null hypothesis |
paired |
a logical indicating whether you want a paired test |
alternative |
a character string specifying the alternative hypothesis, must be one of |
conf.level |
confidence level of the interval |
If only x
is given, or if both x
and y
are given and paired = TRUE
, a Wilcoxon signed rank test of the null hypothesis that the distribution of x
(in the one sample case) or of x - y
(in the paired two sample case) is symmetric about mu
is performed.
Otherwise, if both x
and y
are given and paired = FALSE
, a Wilcoxon rank sum test is done. In this case, the null hypothesis is that the distribution of x
and y
differ by a location shift mu
, and the alternative is that they differ by some other location shift (and the one-sided alternative "greater"
is that x
is shifted to the right of y
).
A list of class htest
, containing the following components:
statistic |
the value of the test statistic with a name describing it |
p.value |
the p-value for the test |
null.value |
the location parameter |
alternative |
a character string describing the alternative hypothesis |
method |
the type of test applied |
data.name |
a character string giving the names of the data |
conf.int |
a confidence interval for the location parameter |
estimate |
an estimate of the location parameter |
The function is rather primitive and should only be used for problems with fewer than 19 observations as the memory requirements are rather large.
Alan T. Arnholt <[email protected]>
Gibbons, J.D. and Chakraborti, S. 1992. Nonparametric Statistical Inference. Marcel Dekker Inc., New York.
Hollander, M. and Wolfe, D.A. 1999. Nonparametric Statistical Methods. New York: John Wiley & Sons.
# Wilcoxon Signed Rank Test PH <- c(7.2, 7.3, 7.3, 7.4) wilcoxE.test(PH, mu = 7.25, alternative = "greater") # Wilcoxon Signed Rank Test (Dependent Samples) with(data = Aggression, wilcoxE.test(violence, noviolence, paired = TRUE, alternative = "greater")) # Wilcoxon Rank Sum Test x <- c(7.2, 7.2, 7.3, 7.3) y <- c(7.3, 7.3, 7.4, 7.4) wilcoxE.test(x, y) rm(PH, x, y)
# Wilcoxon Signed Rank Test PH <- c(7.2, 7.3, 7.3, 7.4) wilcoxE.test(PH, mu = 7.25, alternative = "greater") # Wilcoxon Signed Rank Test (Dependent Samples) with(data = Aggression, wilcoxE.test(violence, noviolence, paired = TRUE, alternative = "greater")) # Wilcoxon Rank Sum Test x <- c(7.2, 7.2, 7.3, 7.3) y <- c(7.3, 7.3, 7.4, 7.4) wilcoxE.test(x, y) rm(PH, x, y)
Random sample of wool production in kilograms on 5 different days at two different locations
A data frame with 15 observations on the following 2 variables:
wool production in thousands of kilograms
wool production in thousands of kilograms
Ugarte, M. D., Militino, A. F., and Arnholt, A. T. (2008) Probability and Statistics with R. Chapman & Hall/CRC.
with(data = Wool, t.test(textileA, textileB))
with(data = Wool, t.test(textileA, textileB))
This function is based on the standard normal distribution and creates confidence intervals and tests hypotheses for both one and two sample problems.
z.test( x, sigma.x = NULL, y = NULL, sigma.y = NULL, sigma.d = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, conf.level = 0.95, ... )
z.test( x, sigma.x = NULL, y = NULL, sigma.y = NULL, sigma.d = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, conf.level = 0.95, ... )
x |
a (non-empty) numeric vector of data values |
sigma.x |
a single number representing the population standard deviation for |
y |
an optional (non-empty) numeric vector of data values |
sigma.y |
a single number representing the population standard deviation for |
sigma.d |
a single number representing the population standard deviation for the paired differences |
alternative |
character string, one of |
mu |
a single number representing the value of the mean or difference in means specified by the null hypothesis |
paired |
a logical indicating whether you want a paired z-test |
conf.level |
confidence level for the returned confidence interval, restricted to lie between zero and one |
... |
Other arguments passed onto |
If y
is NULL
, a one-sample z-test is carried out with x
provided sigma.x
is not NULL
. If y is not NULL
, a standard two-sample z-test is performed provided both sigma.x
and sigma.y
are finite. If paired = TRUE
, a paired z-test where the differences are defined as x - y
is performed when the user enters a finite value for sigma.d
(the population standard deviation for the differences).
A list of class htest
, containing the following components:
statistic |
the z-statistic, with names attribute |
p.value |
the p-value for the test |
conf.int |
is a confidence interval (vector of length 2) for the true mean or difference in means. The confidence level is recorded in the attribute |
estimate |
vector of length 1 or 2, giving the sample mean(s) or mean of differences; these estimate the corresponding population parameters. Component |
null.value |
the value of the mean or difference of means specified by the null hypothesis. This equals the input argument |
alternative |
records the value of the input argument alternative: |
data.name |
a character string (vector of length 1) containing the actual names of the input vectors |
For the one-sample z-test, the null hypothesis is that the mean of the population from which x
is drawn is mu
. For the standard two-sample z-test, the null hypothesis is that the population mean for x
less that for y
is mu
. For the paired z-test, the null hypothesis is that the mean difference between x
and y
is mu
.
The alternative hypothesis in each case indicates the direction of divergence of the population mean for x
(or difference of means for x
and y
) from mu
(i.e., "greater"
, "less"
, or "two.sided"
).
The assumption of normality for the underlying distribution or a sufficiently large sample size is required along with the population standard deviation to use Z procedures.
For each of the above tests, an expression for the related confidence interval (returned component conf.int
) can be obtained in the usual way by inverting the expression for the test statistic. Note that, as explained under the description of conf.int
, the confidence interval will be half-infinite when alternative is not "two.sided"
; infinity will be represented by Inf
.
Alan T. Arnholt <[email protected]>
Kitchens, L.J. 2003. Basic Statistics and Data Analysis. Duxbury.
Hogg, R. V. and Craig, A. T. 1970. Introduction to Mathematical Statistics, 3rd ed. Toronto, Canada: Macmillan.
Mood, A. M., Graybill, F. A. and Boes, D. C. 1974. Introduction to the Theory of Statistics, 3rd ed. New York: McGraw-Hill.
Snedecor, G. W. and Cochran, W. G. 1980. Statistical Methods, 7th ed. Ames, Iowa: Iowa State University Press.
with(data = Grocery, z.test(x = groceries, sigma.x = 30, conf.level = 0.97)$conf) # Example 8.3 from PASWR. x <- rnorm(12) z.test(x, sigma.x = 1) # Two-sided one-sample z-test where the assumed value for # sigma.x is one. The null hypothesis is that the population # mean for 'x' is zero. The alternative hypothesis states # that it is either greater or less than zero. A confidence # interval for the population mean will be computed. x <- c(7.8, 6.6, 6.5, 7.4, 7.3, 7., 6.4, 7.1, 6.7, 7.6, 6.8) y <- c(4.5, 5.4, 6.1, 6.1, 5.4, 5., 4.1, 5.5) z.test(x, sigma.x=0.5, y, sigma.y=0.5, mu=2) # Two-sided standard two-sample z-test where both sigma.x # and sigma.y are both assumed to equal 0.5. The null hypothesis # is that the population mean for 'x' less that for 'y' is 2. # The alternative hypothesis is that this difference is not 2. # A confidence interval for the true difference will be computed. z.test(x, sigma.x = 0.5, y, sigma.y = 0.5, conf.level = 0.90) # Two-sided standard two-sample z-test where both sigma.x and # sigma.y are both assumed to equal 0.5. The null hypothesis # is that the population mean for 'x' less that for 'y' is zero. # The alternative hypothesis is that this difference is not # zero. A 90\% confidence interval for the true difference will # be computed. rm(x, y)
with(data = Grocery, z.test(x = groceries, sigma.x = 30, conf.level = 0.97)$conf) # Example 8.3 from PASWR. x <- rnorm(12) z.test(x, sigma.x = 1) # Two-sided one-sample z-test where the assumed value for # sigma.x is one. The null hypothesis is that the population # mean for 'x' is zero. The alternative hypothesis states # that it is either greater or less than zero. A confidence # interval for the population mean will be computed. x <- c(7.8, 6.6, 6.5, 7.4, 7.3, 7., 6.4, 7.1, 6.7, 7.6, 6.8) y <- c(4.5, 5.4, 6.1, 6.1, 5.4, 5., 4.1, 5.5) z.test(x, sigma.x=0.5, y, sigma.y=0.5, mu=2) # Two-sided standard two-sample z-test where both sigma.x # and sigma.y are both assumed to equal 0.5. The null hypothesis # is that the population mean for 'x' less that for 'y' is 2. # The alternative hypothesis is that this difference is not 2. # A confidence interval for the true difference will be computed. z.test(x, sigma.x = 0.5, y, sigma.y = 0.5, conf.level = 0.90) # Two-sided standard two-sample z-test where both sigma.x and # sigma.y are both assumed to equal 0.5. The null hypothesis # is that the population mean for 'x' less that for 'y' is zero. # The alternative hypothesis is that this difference is not # zero. A 90\% confidence interval for the true difference will # be computed. rm(x, y)
This function is based on the standard normal distribution and creates confidence intervals and tests hypotheses for both one and two sample problems based on summarized information the user passes to the function. Output is identical to that produced with z.test
.
zsum.test( mean.x, sigma.x = NULL, n.x = NULL, mean.y = NULL, sigma.y = NULL, n.y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, conf.level = 0.95, ... )
zsum.test( mean.x, sigma.x = NULL, n.x = NULL, mean.y = NULL, sigma.y = NULL, n.y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, conf.level = 0.95, ... )
mean.x |
a single number representing the sample mean of |
sigma.x |
a single number representing the population standard deviation for |
n.x |
a single number representing the sample size for |
mean.y |
a single number representing the sample mean of |
sigma.y |
a single number representing the population standard deviation for |
n.y |
a single number representing the sample size for |
alternative |
is a character string, one of |
mu |
a single number representing the value of the mean or difference in means specified by the null hypothesis |
conf.level |
confidence level for the returned confidence interval, restricted to lie between zero and one |
... |
Other arguments passed onto |
If y
is NULL
, a one-sample z-test is carried out with x
provided sigma.x
is finite. If y is not NULL
, a standard two-sample z-test is performed provided both sigma.x
and sigma.y
are finite.
A list of class htest
, containing the following components:
statistic |
the z-statistic, with names attribute |
p.value |
the p-value for the test |
conf.int |
is a confidence interval (vector of length 2) for the true mean or difference in means. The confidence level is recorded in the attribute |
estimate |
vector of length 1 or 2, giving the sample mean(s) or mean of differences; these estimate the corresponding population parameters. Component |
null.value |
the value of the mean or difference in means specified by the null hypothesis. This equals the input argument |
alternative |
records the value of the input argument alternative: |
data.name |
a character string (vector of length 1) containing the names x and y for the two summarized samples. |
For the one-sample z-test, the null hypothesis is that the mean of the population from which x
is drawn is mu
. For the standard two-sample z-test, the null hypothesis is that the population mean for x
less that for y
is mu
.
The alternative hypothesis in each case indicates the direction of divergence of the population mean for x
(or difference of means for x
and y
) from mu
(i.e., "greater"
, "less"
, or "two.sided"
).
The assumption of normality for the underlying distribution or a sufficiently large sample size is required along with the population standard deviation to use Z procedures.
For each of the above tests, an expression for the related confidence interval (returned component conf.int
) can be obtained in the usual way by inverting the expression for the test statistic. Note that, as explained under the description of conf.int
, the confidence interval will be half-infinite when alternative is not "two.sided"
; infinity will be represented by Inf
.
Alan T. Arnholt <[email protected]>
Kitchens, L.J. 2003. Basic Statistics and Data Analysis. Duxbury.
Hogg, R. V. and Craig, A. T. 1970. Introduction to Mathematical Statistics, 3rd ed. Toronto, Canada: Macmillan.
Mood, A. M., Graybill, F. A. and Boes, D. C. 1974. Introduction to the Theory of Statistics, 3rd ed. New York: McGraw-Hill.
Snedecor, G. W. and Cochran, W. G. 1980. Statistical Methods, 7th ed. Ames, Iowa: Iowa State University Press.
zsum.test(mean.x = 56/30,sigma.x = 2, n.x = 30, alternative="greater", mu = 1.8) # Example 9.7 part a. from PASWR. x <- rnorm(12) zsum.test(mean(x), sigma.x = 1, n.x = 12) # Two-sided one-sample z-test where the assumed value for # sigma.x is one. The null hypothesis is that the population # mean for 'x' is zero. The alternative hypothesis states # that it is either greater or less than zero. A confidence # interval for the population mean will be computed. # Note: returns same answer as: z.test(x, sigma.x = 1) x <- c(7.8, 6.6, 6.5, 7.4, 7.3, 7.0, 6.4, 7.1, 6.7, 7.6, 6.8) y <- c(4.5, 5.4, 6.1, 6.1, 5.4, 5.0, 4.1, 5.5) zsum.test(mean(x), sigma.x = 0.5, n.x = 11 ,mean(y), sigma.y = 0.5, n.y = 8, mu = 2) # Two-sided standard two-sample z-test where both sigma.x # and sigma.y are both assumed to equal 0.5. The null hypothesis # is that the population mean for 'x' less that for 'y' is 2. # The alternative hypothesis is that this difference is not 2. # A confidence interval for the true difference will be computed. # Note: returns same answer as: z.test(x, sigma.x = 0.5, y, sigma.y = 0.5) # zsum.test(mean(x), sigma.x = 0.5, n.x = 11, mean(y), sigma.y = 0.5, n.y = 8, conf.level=0.90) # Two-sided standard two-sample z-test where both sigma.x and # sigma.y are both assumed to equal 0.5. The null hypothesis # is that the population mean for 'x' less that for 'y' is zero. # The alternative hypothesis is that this difference is not # zero. A 90% confidence interval for the true difference will # be computed. Note: returns same answer as: z.test(x, sigma.x=0.5, y, sigma.y=0.5, conf.level=0.90) rm(x, y)
zsum.test(mean.x = 56/30,sigma.x = 2, n.x = 30, alternative="greater", mu = 1.8) # Example 9.7 part a. from PASWR. x <- rnorm(12) zsum.test(mean(x), sigma.x = 1, n.x = 12) # Two-sided one-sample z-test where the assumed value for # sigma.x is one. The null hypothesis is that the population # mean for 'x' is zero. The alternative hypothesis states # that it is either greater or less than zero. A confidence # interval for the population mean will be computed. # Note: returns same answer as: z.test(x, sigma.x = 1) x <- c(7.8, 6.6, 6.5, 7.4, 7.3, 7.0, 6.4, 7.1, 6.7, 7.6, 6.8) y <- c(4.5, 5.4, 6.1, 6.1, 5.4, 5.0, 4.1, 5.5) zsum.test(mean(x), sigma.x = 0.5, n.x = 11 ,mean(y), sigma.y = 0.5, n.y = 8, mu = 2) # Two-sided standard two-sample z-test where both sigma.x # and sigma.y are both assumed to equal 0.5. The null hypothesis # is that the population mean for 'x' less that for 'y' is 2. # The alternative hypothesis is that this difference is not 2. # A confidence interval for the true difference will be computed. # Note: returns same answer as: z.test(x, sigma.x = 0.5, y, sigma.y = 0.5) # zsum.test(mean(x), sigma.x = 0.5, n.x = 11, mean(y), sigma.y = 0.5, n.y = 8, conf.level=0.90) # Two-sided standard two-sample z-test where both sigma.x and # sigma.y are both assumed to equal 0.5. The null hypothesis # is that the population mean for 'x' less that for 'y' is zero. # The alternative hypothesis is that this difference is not # zero. A 90% confidence interval for the true difference will # be computed. Note: returns same answer as: z.test(x, sigma.x=0.5, y, sigma.y=0.5, conf.level=0.90) rm(x, y)