# The main goal of this project is to help students to build skills in statistical

The main goal of this project is to help students to build skills in statistical analysis by
applying the descriptive statistics tools to estimate the mean COVID-19 Total Cases per
100,000 people (C19TCP100T) and the mean COVID-19 Proportion of Total Deaths in
Total Cases (C19PTDITC) for each of your two selected US selected states, and then use
those estimates and the inferential statistics to test the difference in COVID-19 incidences
across the two selected states. Students are expected to write their final research report
which must describe the population of interest to the analysis, the data collection procedure,
the implementation of the statistical procedure to estimate the population parameters (mean
C19TCP100T and the mean C19PTDITC) using the sample data, the interpretation of the
results, and the policy recommendations.
Learning objectives
Upon completing this research project, the student will be able to:
– Collect and use data in the decision-making process;
– Calculate descriptive statistics;
– Use the Central Limit Theorem to identify the probability distributions of statistics;
– Conduct statistical inference to determine behaviors of population parameters using
sample data;
– Interpret the results of analysis; and
– Make policy recommendations
Problem Statement
The coronavirus disease 2019 (COVID-19), which appeared first in China in late 2019,
has spread quickly across the world, causing in its way significant health, economic,
demographic, and social disruptions. What was initially seen as a largely China-centric
shock has ballooned to full blown global crisis. On March 11, 2020, the World Health
Organization (WHO) declared COVID-19 a global pandemic. COVID-19 has brought
forth new challenges such as social distancing, requirement to wear masks in public
place, teleworking, prohibition of large-scale social events, travel restrictions and others.
Overcoming those challenges has proved to be the best way to contain the spread of the
pandemic and protect lives. In the particular case of the United States, each state has set
forth strategies to contain the spread of the disease and to reduce the number of deaths.
Project Description
You are tasked with determining whether or not there exits difference in COVID-19
incidences across two US states of your choice using COVID-19 data, namely, Total
Cases and Total Deaths and US population data by state.
To complete your project, you will use secondary; 2020 CDC COVID-19 Cases and
Deaths by State over time – 2020 (https://data.cdc.gov/Case-Surveillance/United-StatesCOVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data) to estimate the difference in
COVID-19 incidences across two states (The Excel file on this dataset is attached). You
will also have to test the hypothesis of no difference in COVID-19 incidences across two
states.
Steps for conducting the statistical analysis are described below.
1. Data collection and visualization
The dataset on COVID-19 Total Cases and Total Cases by state in 2020 and on
the US population by state in 2020 is attached. Select a simple random sample for
your selected states which must be the third of the total number of observations. If
the third of observations is less than 30, increase the number to 30 by randomly
selecting the missing observations. Next, generate the COVID-19 Total Cases per
100,000 people (C19TCP100T) and the COVID-19 Proportion of Total Deaths in
Total Cases (C19PTDITC)
To generate the C19TCP100T for each state, generate first the population in
100,000 units by dividing the population of state by 100,000. Then, divide Total
Cases by the population in 100,000 units to generate the C19TCP100T for each
state.
To generate C19PTDITC for each state, divide the Total Deaths for the state by
the Total Cases for the same state and multiply the results by 100 to express it as a
percent.
Next, plot the C19TCP100T for the two samples in the same chart (visualization)
to detect whether or not there exist differences in Total Cases per 100,000 people.
Do the same for the C19PTDITC. The visualizations should be presented using
EXCEL or SPSS visualizations.
2. Estimation of the mean, variance and standard deviation for each of the two
COVID-19 variables
This step consists of estimating the mean, variance and standard deviation for
each of the 2 COVID-19 variables for each state, that is, C19TCP100T and
C19PTDITC. Those statistics can be generated in Excel using Data => Data
Analysis => Descriptive Statistics (select Summary Statistics)
3. Point estimation and interval estimation of C19TCP100T differentials and of
C19PTDITC differentials across the two states
The estimates of the means C19TCP100T, their standard deviations as well as
their sample sizes are the inputs needed to calculate point estimate and the
interval estimation of C19TCP100T differentials (use the confidence level of your
choice, preferably between 95% and 99%). Likewise, the estimates of means
C19PTDITC, their standard deviations as well as their sample sizes are the inputs
needed to calculate point estimate and the interval estimation of C19PTDTC
differentials (use the confidence level of your choice, preferably between 95%
and 99%). If the sample size of each state is 30 or more, assume that the standard
deviation from the sample is the same as the population standard deviation and
use the Z distribution to construct the confidence interval. But, if the sample size
of your group is less than 30, use the t distribution to construct the confidence
interval.
Next, reduce the margin of error by 75% and calculate the sample size needed to
achieve such target. Finally, reconstruct the confidence intervals of estimates of
C19TCP100T differential that would result from such simple sample. Repeat the
same procedure for the C19PTDITC differentials.
4. Hypothesis testing of the non-existence of COVID-19 Incidences differentials
In this step, the hypothesis testing procedure will be implemented to test the nonexistence of COVID-19 incidences differentials for each of the two variables. The
hypothesis of non-existence of COVID-19 incidences differentials will be tested
against the alternative hypothesis of existence of COVID-19 incidences
differentials. This step is crucial since it helps to determine whether or not the
observed estimated value of COVID-19 incidences differentials is due to the
random errors. Choose the confidence level between 95% and 99% to conduct
your hypothesis testing. Also, follow the same guidelines highlighted in point 3 to
determine the type of distribution to be used in hypothesis testing. The hypothesis
testing procedure is summarized below.
– Determine the null and alternative hypotheses.
– Choose the significance of level (preferably, set α = 0.05).
– Validate the assumptions of the hypothesis test, identify the appropriate test
statistic, and compute its value (compute alternatively the P-value).
– Use the graphs to determine if you should be conducting a two-sample test of
the mean with equal or unequal variances.
– Compare the value of your statistic to the theoretical value from the statistical
Tables (compare alternatively P-value to the level of significance α).
– Make a decision to reject or fail to reject the null hypothesis.
– State the conclusion
5. Interpretation of results
Describe the meaning of your results and how they can be used for policy
recommendations.
– This project will be graded out of 100 points and will contribute 10% to your final
grade in this course.
– The key success factor for this project is to use the correct and cleaned data and
demonstrate a systematic approach to data analysis by using the appropriate tools.
– This project should be completed in Excel or SPSS. There is a free version of SPSS
available for STAT 101 on the IBM cognitive class (the link to the course is
https://cognitiveclass.ai/courses/statistics-101/
– The final report of your project must be typed; multiple line-spaced (at 1.15) and
must contain an introduction, a section describing your methodology, a data
analysis section and a conclusion section that summarizes the results of your
analysis. The formulas used should be shown in detail, and the calculations shown
clearly. All cited work and sources of information must be listed in the reference
list.
– You should each keep a log on what you have been assigned to do and what you
have accomplished
– The project will be evaluated by me and you will receive a discounted grade if there
are significant discrepancies.
– The assessment rubric is attached.
Format
Each project will be 5 pages maximum (appendix not included) and must be written using
the following guidelines and contents:
– Title page (Include project title and your name)
– Introduction: Problem of the propose study, purpose and justification of the study
– Methodology
– Data Collection and Cleaning
– Data analysis
– Interpretation of results
– Findings and conclusion.
– Appendices: Tables, Figures.
– References
Font must be Time New Roman (or Calibri) and Font size must be 12. The line spacing
must be multiple at 1.15. The spacing before must be 6 Pt and the spacing after must be 6
Pt.
States Selected: California & Florida