This is a course in data analysis. Topics covered include: Simple and multiple linear regression, causation, diagnostics, logistic regression and generalized linear models; Model selection: prediction risk, bias-variance tradeoff, risk estimation, model search, ridge regression and lasso, stepwise regression; smoothing and nonparametric regression: linear smoothers, kernels, local regression, penalized regression, splines, variance estimation, confidence bands, local likelihood, additive models. Students will practice real-world data analysis through several course projects.

This course is primarily for first-year PhD students in Statistics & Data Science. It requires an appropriate background for entering that program, including linear algebra, multivariate calculus, and basic statistical theory. For example, an appropriate background would be to have received an A in a course on statistical inference of the caliber of 36-700 or 36-705 and either have extensive experience in statistical data analysis or received As in applied statistics classes focused on analyzing data, such as 36-401/607 **and** 36-402/608. Students should also be familiar with R on a level similar to 36-350 Statistical Computing, and be able to write a data analysis report (or have experience writing similar reports, such as lab reports).

Students not in the Statistics & Data Science PhD program can add themselves to the waitlist, and should contact the instructor to be added to the class. Describe your prior statistics experience and course work so we can ensure you are prepared for the course. Students interested in an applied regression course requiring less mathematical background should consider 36-401/607 (but contact the instructor to make sure it’s suitable for you!).

Most assignments in this course will use R and R Markdown.

By the end of this course, students will be able to:

- Derive and compare the statistical properties of regression methods.
- Interpret the statistical meaning of regression results.
- Choose regression procedures that are appropriate for a given dataset and substantive question to answer.
- Select appropriate graphical and statistical diagnostics to verify the assumptions of regression methods.
- Interpret the results of these diagnostics to select appropriate procedures, and explain how they affect the interpretation of the results.
- Explain the scientific meaning of regression results in non-statistical terms.
- Use R and R Markdown to analyze data and construct data analysis reports.
- Write data analysis reports that integrate text, graphics, and statistical reports to tell a coherent story and answer substantive questions.

Notice the emphasis on writing, practical data analysis, and modeling data to answer substantive questions. While this course covers the theory of regression and motivates methods statistically, our goal is for you to become well-informed data analysts with a thorough command of the methods you apply to real data. Many parts of this course are designed to prepare Statistics & Data Science PhD students for their first-year Data Analysis Exam.

I may assign reading and exercises from these books:

- Sanford Weisberg,
*Applied Linear Regression*, 4th edition (2013). The 4th edition is available using your CMU Library access. - Trevor Hastie, Robert Tibshirani, and Jerome Friedman,
*The Elements of Statistical Learning*, 2nd edition (2009). Available electronically through SpringerLink or from the authors’ website (but note the $40 Springer MyCopy is black and white, though many figures are in color!). - George Seber and Alan Lee,
*Linear Regression Analysis*, 2nd edition. Available online through Wiley. - Ronald Christensen,
*Plane Answers to Complex Questions*, 5th edition (2020). A good geometrically based reference to linear model theory. Available electronically through SpringerLink and in $40 MyCopy paperback. - David Harville,
*Matrix Algebra from a Statistician’s Perspective*. Available through SpringerLink. - Hadley Wickham and Garrett Grolemund,
*R for Data Science*. Freely available online or can be purchased in print.

To get access to online copies while off campus, use the CMU VPN.

You do **not** need to buy paper copies. I recommend using the electronic versions as needed, both for assigned readings and as a general reference, and if you discover one book is particularly helpful to you, consider buying a copy.

Other readings may be posted on Canvas as needed.

This course features three major data analysis reports, to be completed individually. In these reports you will practice the skills taught in this class: you will be given a real dataset and several substantive questions about it, and you will examine the data, decide on the best methods to answer the substantive questions, conduct your analysis, and write a report describing your analysis and your results. This is meant to give you realistic practice solving real statistical problems, and particularly to give you practice *writing* about data analysis for your future research and publications.

These reports are designed to resemble the Data Analysis Exam given to Statistics & Data Science PhD students at the end of their first year.

A rubric for the reports will be posted on Canvas and lists the criteria that define a satisfactory report. Rather than assigning a numerical grade when we review your reports, we’ll give you feedback on how well you met each rubric criterion, and give you the opportunity to revise your report to improve any area that may need more work. See the Grading section below for more details.

Besides the project, you will also complete regular weekly homework. The homework will include theoretical derivations and proofs, plus practical problems conducting simulations, analyzing data, and exploring particular methods. A rubric posted on Canvas will describe the formatting requirements for homework submissions.

Another key weekly activity will be reading. I will assign readings and a few related questions to be answered through Canvas, so that you will be familiar with new topics before we begin discussing them in class. Most readings will come from the textbooks listed above, and the rest will be papers or other materials provided on Canvas.

For homework and data analysis reports, you will have three “grace days” you can use throughout the semester. Each time you use a grace day for an assignment, you get 24 hours extra to submit the assignment. You do not need any excuse to use grace days. Once you have used all three grace days, late work will not be accepted.

This system is meant to allow you flexibility, so that ordinary problems (minor illness, forgot a deadline, had to finish another class’s big assignment, traveled to an event) don’t harm you, and so you do not need my permission to handle unexpected problems. If you experience a serious emergency that prevents you from completing work for a longer time, contact me so we can make arrangements.

Late reading assignments will **not** be accepted, since reading assignments are intended to prepare you for a specific day of class.

Class attendance and participation is essential. If there’s any one message to be learned from pedagogical research, it’s that listening passively to a lecture is **not** a good way to learn how to think about complicated problems. As a result, we will use much of our class time for demonstrations and activities, such as

- practicing new data analysis techniques with real data in R
- running simulations to validate tests or diagnostics
- conducting peer review of data analysis reports
- examining data case studies to determine the appropriate statistical methods to solve real problems

**You are expected to attend class and participate in these activities.** Many of the activities will be expanded upon in homework assignments and submitted for homeork credit.

You should bring your laptop to class, as some activities will involve using R for data analysis. But please do **not** distract your classmates by using your laptop in class to do things unrelated to the class, however tempting it may be.

If you cannot attend a class for any reason, please let me know as far in advance as is possible. Class sessions are not recorded, and remote attendance of in-person classes is not possible.

When attending class, you are expected to follow all COVID-19 precautions required by the University.

- Homework assignments
- Each homework problem will be graded on a simple 2-point scale. 2 = satisfactory, perhaps with small flaws that do not affect the conclusion; 1 = satisfactory, but with flaws that do affect the conclusion; 0 = missing or does not address the problem.
- Analysis reports
- Following the rubric, a report will be graded Satisfactory if, after revision, the report meets all but a maximum of three of the rubric criteria. Reports that meet all criteria at a high standard of quality will be graded Excellent.

Grades will be assigned according to a table. The homework grades will be averaged. You will earn the highest letter grade for which you meet **both** criteria:

Homework average | Reports | Grade |
---|---|---|

> 85% | 3 excellent | A+ |

> 80% | 2 excellent, 1 satisfactory | A |

> 75% | 1 excellent, 2 satisfactory | B |

> 60% | 3 satisfactory | C |

< 60% | 3 satisfactory | D |

< 60% | < 3 satisfactory | R |

If you have concerns about how any of your work was graded, please discuss your concerns with me within **two weeks** of the graded work being returned to you.

The homework average column may be adjusted at the instructor’s discretion, but only in your favor.

Grades can be adjusted up or down one step based on class participation, at the instructor’s discretion. Typically this will only be used in extraordinary cases, such as to penalize a refusal to participate in class activities.

Note that in Dietrich College, the minimum passing grade in Ph.D. courses is B-. This means that if you elect to take the course pass/fail, the Registrar will automatically convert grades below B- to N (no credit) when grades are posted.

Discussing homework and projects with your classmates is allowed and encouraged, and helping explain ideas to each other is a core part of the academic experience. But it is important that every student get practice working on their own. This means that **all the work you turn in must be your own.** You must devise and write your own code, generate your own graphics, and write your own solutions and reports.

You may use external sources (books, websites, papers) to

- Look up R documentation, find useful packages, find explanations for error messages, or remind yourself about the functions to fit some model,
- Find reference materials on statistical methods,
- Clarify material from the course notes or examples.

But external sources must be used to **support** your work, not to **obtain** your work. You may **not** use them to copy code, text, or graphics without attribution. You may **not** use any prior course’s or textbook’s homework solutions in any way. This prohibition applies even to students who are re-taking the course. Do not copy old solutions (in whole or in part), and do not “consult” or read them. Doing any of that is cheating, making any feedback you get meaningless and any evaluation based on that assignment unfair.

If you do use *any* material from other sources, you **must** clearly mark its source. Text taken from other sources must be in quotation marks with citations; figures from other sources need a caption indicating the source; and code from other sources must have a comment indicating the source. We must be able to determine who wrote any material you submit, and you must not falsely imply that you completed work actually done by others.

Please talk to me if you have any questions about this policy. Any form of cheating or plagiarism is grounds for sanctions to be determined by the instructor, including grade penalties or course failure. Students taking the course pass/fail may have this status revoked. I am also obliged in these situations to report the incident to your academic program and the appropriate University authorities. Please refer to the University Policy on Academic Integrity.

If you have a disability and have an accommodations letter from the Disability Resources office, I encourage you to discuss your accommodations and needs with me as early in the semester as possible. I will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with the Office of Disability Resources, I encourage you to contact them at access@andrew.cmu.edu.

We must treat every individual with respect. We are diverse in many ways, and this diversity is fundamental to building and maintaining an equitable and inclusive campus community. Diversity can refer to multiple ways that we identify ourselves, including but not limited to race, color, national origin, language, sex, disability, age, sexual orientation, gender identity, religion, creed, ancestry, belief, veteran status, or genetic information. Each of these diverse identities, along with many others not mentioned here, shape the perspectives our students, faculty, and staff bring to our campus. We, at CMU, will work to promote diversity, equity and inclusion not only because diversity fuels excellence and innovation, but because we want to pursue justice. We acknowledge our imperfections while we also fully commit to the work, inside and outside of our classrooms, of building and sustaining a campus community that increasingly embraces these core values.

Each of us is responsible for creating a safer, more inclusive environment.

Unfortunately, incidents of bias or discrimination do occur, whether intentional or unintentional. They contribute to creating an unwelcoming environment for individuals and groups at the university. Therefore, the university encourages anyone who experiences or observes unfair or hostile treatment on the basis of identity to speak out for justice and support, within the moment of the incident or after the incident has passed. Anyone can share these experiences using the following resources:

- Center for Student Diversity and Inclusion: csdi@andrew.cmu.edu, (412) 268-2150
- Report-It online anonymous reporting platform. username:
`tartans`

password:`plaid`

All reports will be documented and deliberated to determine if there should be any following actions. Regardless of incident type, the university will use all shared experiences to transform our campus climate to be more equitable and just.

All of us benefit from support during times of struggle. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is almost always helpful.

If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 or visit their website. Consider reaching out to a friend, faculty or family member you trust for help getting connected to the support that can help.