Homework -5 Multiple Regression

Subject: Mathematics    / Statistics
Question
Homework #5
Multiple Regression
Please submit on Canvas a separate Excel file for each problem. The hard copy of the homework
that you are turning in should include the answers to the questions asked.
You can refer to charts and outputs in your electronic files to support your answers, however
all conclusions should be clearly stated in the hard copy of the homework.
Problem 1 (20 points)
When potential workers apply for a job that requires extensive manual assembly of small
intricate parts, they are initially given three different tests to measure their manual dexterity.
The ones who are hired are then periodically given a performance rating on a 0 to 100 scale that
combines their speed and accuracy in performing the required assembly operations.
The file P11_49.xlsx lists the test scores and performance ratings for a randomly selected group
of 80 employees. It also lists their seniority (months with the company) at the time of the
performance rating.
Employee
1
2

79
80

Seniority
19
14

17
11

Test1
65
64

57
47

Test2
45
58

68
61

Test3
38
47

61
44

Performance
67
58

56
54

a. Use the Correlation and Covariance option in StatTools or the Correlation option in the
Data Analysis tool in Excel (or the CORREL function) to calculate a table of correlations:
Seniority
Seniority
Test1
Test2
Test3
Performance

Test1

Test2

Test3

Performance

1.000
1.000
1.000
1.000
1.000

b. Based only on these correlations, can you say with certainty that the value of the coefficient
of determination r 2 for the regression of performance rating on seniority and test scores
will be at least 35%? Why or why not?
c. Is there any evidence in the correlation table that collinearity will be a problem?
Why or why not?

-1-

d. Run the regression of Performance Rating versus all four explanatory variables.
List the regression equation and the values of r 2 and s e .
e. Do all of the coefficients have the signs (negative or positive) you would expect?
Briefly explain.
f. If a worker (outside of the 80 in the sample) has 15 months of seniority and test scores of
57, 71, and 63, find a prediction for this worker’s Performance rating score.
g. One of the t-values for the regression coefficients is less than 1. Explain briefly
why this occurred. Does it mean that this variable is not related to Performance Rating?
h. Find the variance inflation factor VIF corresponding to this variable.
What does the value of this VIF tell you?
i. Arguably, the three test measures provide overlapping (or redundant) information.
For the sake of parsimony (explaining “the most with the least”), it might be sensible to regress
Performance rating versus only two explanatory variables, Seniority and Average Test,
where Average Test is the average of the three test scores:
Average Test = (Test1 + Test2 + Test3)/3
Run this regression and report the equation itself, r 2 and s e .
j. Can you argue that this equation is just as good as the equation in part (d)? Explain briefly.

-2-

Problem 2 (20 points)
The following market research study was conducted by a national chain of consumer electronics
stores. To promote sales, the chain relies heavily on local newspaper advertising to support its
modest exposure in nationwide television commercials. A sample of 20 cities with similar
populations and monthly sales totals were assigned different newspaper advertising budgets for
one month. The file Advertising.xlsx summarizes the sales (in $ millions) and the newspaper
advertising budgets (in $1,000) observed during the study.
City
1
2

19
20

Sales
6.14
6.04

6.84
6.99

Advertising
5
5

25
25

a. Construct a scatter plot for advertising and sales. What kind of relationship does it suggest?
b. Fit a quadratic regression model. List the regression equation and the values of r 2 and s e .
c. Interpret the meaning of the coefficient of determination.
d. Predict the monthly sales for a city with newspaper advertising of $20,000.
e. Construct a normal probability plot of residuals and a plot of residuals versus fitted values.
Are the four assumptions for regression analysis satisfied? Explain.
f. At the 0.05 level of significance, is there a significant overall quadratic relationship between
monthly sales and newspaper advertising? Formulate and perform the appropriate hypothesis test.
g. At the 0.05 level of significance, determine whether the quadratic model is a better fit than the
linear model. Formulate and perform the appropriate hypothesis test.
h. Run a linear regression. Report the regression equation, r 2 and s e .
i. Do your results in (h) seem to support or contradict the conclusion you reached in (g)? Explain.

-3-

Problem 3 (50 points)
The price of advertising is different from one consumer magazine to another. What causes these
differences in price? Probably something related to the value of the ad to the advertiser.
Magazines that reach more readers (all else equal) should be able to charge more for an ad.
Also, magazines that reach a better-paid reading audience should probably be able to charge more.
Although there may be other important factors, let’s look at these two together with one more,
gender difference, to see if magazines charge more based on the percentage of men or women
among the readers. Multiple regression will provide some answers and can help explain the
impact of audience size, income, and gender on advertising prices.
The file Magazine_Ads.xlsx contains the multivariate data set to be analyzed. The variables are:
y– page costs for a “four-color, one-page ad run once”

x1 – audience (readers) in thousands

x 2 – percent male among the readership
x3 – median household income
Magazine
1
2

44
45

AAA Westways
AARP The Magazine

Wired
Woman’s Day

Page Costs
($)
53310
532600

99475
259960

Audience
(thousands)
8740
35721

2789
20325

Male
(%)
47.0
39.7

75.5
0.0

Income
($)
92600
58990

91056
58053

a. Formulate the multiple regression model.
b. Run the multiple regression and report the estimated regression equation.
c. Interpret the intercept. Does it make sense in this context?
d. Interpret the slope coefficients.
e. Below is the correlation matrix.

Audience
Male
Income
Page Costs

Audience

Male

Income

Page Costs

1
–0.215
–0.377
0.850

1
0.540
–0.126

1
–0.148

1

The correlation between page costs and median income is –0.148 indicating that higher
median income is actually associated with lower page costs. How can this be?
Doesn’t it contradict the interpretation of the regression coefficient for income in (d)?
-4-

f. The magazine Martha Stewart Living is in the original data set:
Magazine
25 Martha Stewart Living

Page Costs
($)
157,700

Audience
(thousands)
11,200

Male
(%)
11.0

Income
($ thousands)
74,436

What is the residual or prediction error for this magazine?
How can Martha Stewart Living charge so much less than we would expect?
g. Report and interpret the standard error of estimate.
h. Interpret the coefficient of determination.
i. Is there some predictive relationship between the x variables and the y in the population of
similar conceivable magazines? Formulate and perform the appropriate test.
j. Could the regression coefficient for percent male reasonably be just randomness?
In other words, could it be that the effect of percent male readership on page costs is just a
random number, rather than indicating a systematic relationship?
Formulate and perform the appropriate test.
k. Use the correlation matrix and scatterplots (one for each pair of variables) to describe the
structure of the relationships among x1 , x2 , x3 , y.
l. What does the plot of residuals versus predicted page costs show?
Are the regression assumptions challenged?
m. Create histograms of the three explanatory variables. Does any of them show skewness?
n. Use Excel’s LN function to transform the variable with the highest skewness.
What does the histogram of the transformed variable show?
o. Run a regression with this transformed variable and the other variables unchanged.
Report r 2 and s e . Do you think the log transformation of the variable with highest skewness
improved the regression results?
p. Construct and discuss a plot of residuals versus predicted page costs for the model with the
transformed variable. Can you suggest a remedy for the problems (if any) you see in this plot?
q. Transform all of the original variables that measure amounts (that is, page costs, audience,
and median income) using natural logarithms. What is the equation of this model?
r. Run a regression for this model. Report r 2 and s e . Can we use r 2 and s e to compare
directly the results from this model with the original model? Why or why not?
s. Construct a residual plot and discuss if the log transformations have helped.
t. What is the estimated regression equation?
Interpret the slope coefficient for log audience.
u. Does audience have a significant impact on page costs, holding percent male and median
income fixed?
v. What is the predicted value of page costs for Martha Stewart Living?
Compare this value with the value predicted in part (f).
Which of the two models provides a better prediction for this magazine?
-5-