    # Statistics

Have you heard the latest statistics joke? Probably! # LibreOffice Calc

LibreOffice Calc is the free spreadsheet program of the LibreOffice suite. It comes with many functions, including those for imaginary numbers, as well as financial and statistical functions. It is very intuitive and easy to learn.

Let’s work with a spreadsheet that contains data about average life expectancy by country.

Life Expectancy
Country Code Life Expectancy
Aruba ABW 76.293
Afghanistan AFG 64.833
Angola AGO 61.147
Albania ALB 78.573
United Arab Emirates ARE 77.972
Argentina ARG 76.667
Armenia ARM 75.087
Antigua, Barbuda ATG 77.016
Australia AUS 82.9
Austria AUT 81.79…

We are going to analyze the data in our spreadsheet using Calc’s functions. A function is a predefined calculation or formula entered in a cell to help you analyze or manipulate data in a spreadsheet.

You can insert functions in Calc either using the Function Wizard or manually:

• Select the cell that will contain the formula (C267).
• Click the fx button on the Formula bar (alternatively select the Function option in the Insert menu or type the shortcut Ctrl+ F2) to open the Function Wizard.
• Select a category of functions (Statistical) to shorten the list, then scroll down through the functions and select the required one (AVERAGE) by double-clicking on it. When you select a function, its structure (AVERAGE( Number 1, Number2,…) and description (Returns the average of a sample) show up on the right-hand side of the dialog.
• Press the Shrink button to shrink the wizard and select the cells from the spreadsheet using the mouse where your current argument is located or type the argument directly into the formula. • Click OK to exit the wizard and the result will appear: 72.760… • Alternatively you can insert functions manually. Select the cell where the result will appear and type = (the equal sign), the function name (AVERAGE), and its arguments (they can refer to both individual cells and cell ranges and must be enclosed within parentheses), e.g., =AVERAGE(C4:C265).

Averages are used to represent a large set of numbers with a single number. It refers to the sum of a group of values divided by n, where n is the number of values in the group. The “median” is the “middle” value in the list of numbers, in other words, the value separating the higher half from the lower half of the list: =MEDIAN(C4:C266). The mode is the value that appears most frequently in the list: =MODE(C4:C266).

You may want to use the function ROUND (ROUND(2.43) returns 2, ROUND(2.65) returns 3) to round the life expectancy values to a certain precision before using the MODE function.

1. Select the cell at the right of the first value (life expectancy of Aruba), BK6, and enter the formula: =ROUND(BJ6).
2. Select this cell and hover the mouse cursor over a small square at the lower right-hand corner of the cell. As you do this, the cursor will change to a black cross. Hold and drag the black cross down the column over the cells where you want to copy the formula.
3. Finally, select a cell and define the function over the new column: =MODE(BK6:BK271). Variance is the average of the squared differences from the mean, it measures how far a set of numbers is spread out from their average value: =VAR(C4:C266). The Standard Deviation is the square root of the Variance, a measure of the amount of variation or dispersion of a set of values: =STDEV(C4:C266).

# Statistics with R

R is a free programming language and statistical software for analyzing and visualizing data.

You could download it and install it from The Comprehensive R Archive Network. You should install RStudio, too (sudo apt-get install gdebi-core && wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-1.4.1717-amd64.deb && sudo gdebi rstudio-server-1.4.1717-amd64.deb). It runs on port 8787 and accepts connections from all remote clients. After installation, open a web browser and access the server by typing the following address: http://server-ip:8787 • You can create a vector representing academic results by typing: results <- c(2, 1, 8, 5, 4, 6, 9, 2, 3, 4, 3, 6, 7, 10)
• mean(results) calculates the sample mean. median(results) computes the sample median. var(results) and sd(results) return the variance and standard deviation.
• R’s var() function is actually computing what is sometimes called the quasi-variance. To calculate the variance, you have to multiply this number by (N–1)/N: varianza <- function(x){var(x)*(length(x)-1)/length(x)} We are defining a function and length(x) is the number of elements in x, eg., length(results) = 14. varianza(results) is what we are looking for.
• plot(results), plot is the generic function for plotting of R objects.
• You may also want to compute: max and min select the largest and smallest elements of results (max(results) = 10, min(results) = 1); sum(results) gives the total of the elements in results (=70); sort(results) returns a vector of the same size as results with the elements arranged from smallest to largest 1 2 2 3 3 4 4 5 6 6 7 8 9 10.
• Let’s repeat the same calculations with a second vector: results <- c(4, 6, 7, 5, 4, 6, 7, 4, 5, 4, 6, 6, 7, 4). varianza(results) = 1.372449 < 7.142857. It means that these academic results are less dispersed and spread out or, in other words, their average 5.357143 represents much better the data set.

I would like to say something about the old joke, “A statistician can have his head in the oven and his feet in the freezer, and he will say that on average he feels fine.”

``````>>> x <- c(50, 0) # Let's say that the head in the oven is at 50°C and his feet in the freezer are at 0°C.
>>> mean(x)
 25 # The average temperature may be comfortable...
>>> varianza(x)
 625 # ... but, there is always a but in this world, the variance is very high. It indicates that the data points are very spread out from the mean, and from one another.
``````

In 1992 Mackowiak, Wasserman, and Levine measured the body temperatures of 65 men and 65 women: oral (33.2 – 38.2°C, avg(33.2, 38.2) = 35.7), rectal (please, let’s not go into details, 34.4 –37.8°C, avg(34.4, 37.8) = 36.1), tympanic (ear canal, 35.4 – 37.8°C, avg(35.4, 37.8) = 36.6), axillary (armpit, 35.5 –37.0°C, avg(35.5, 37.0) = 36.25).

``````>>> x <- c(35.7, 36.1, 36.6, 36.25)
>>> mean(x)
 36.1625
>>> varianza(x)
 0.1042187 # A small variance indicates that _the data points are very close to the mean, and to each other. All of the data values are almost identical.
``````

Finally, we are going to illustrate a concept with an example:

``````>>> results <- c(4, 6, 7, 5, 4, 6, 7, 4, 5, 4, 6, 6, 7, 4)
>>> mean(results)
 5.357143
>>> median(results)
 5.5
>>> results <- c(4, 6, 7, 5, 4, 6, 37, 4, 5, 4, 6, 6, 7, 4) # An outlier (37, an outlier is a data point that differs significantly from other observations) has been introduced into the data set.
>>> mean(results)
 7.5 # The 37 is an outlier, which distorts the mean, therefore the median is a better option.
>>> median(results)
 5.5
``````

# Introduction to Plotting in R

Let’s use the following survey: Public opinion on the most important problems facing the U.S. 2021.

“20 percent of respondents stated that poor leadership and a general dissatisfaction with the government were the most important problems facing the U.S. Furthermore, another 25 percent of respondents said that the most important problem facing the U.S. was the coronavirus pandemic,” published by Statista Research Department, Mar 29, 2021. Besides, 8% Race relations/Racism, 8% Economy in general, 8% Immigration, 31 Others (Unemployment, Healthcare, debt, etc.).

We are going to define two vectors: problems <- c(25, 20, 8, 8, 8, 31) (problems facing the US) and labels <- c(“Covid”, “Government”, “Racism”, “Economy”, “Immigration”, “Others”) (labels or descriptions).

Next, we will create a barplot: barplot(problems, main=“Most important problems facing the US”, ylab=“Percentage of people”, names.arg=labels, col=rainbow(6)), where problems determine the heights of the bars in the plot. We include the option names.args=labels to label the bars, add a title (main=“Most important problems facing the US”) and a label for the y-axis (ylab=“Percentage of people”), and use rainbow colors (col=rainbow(6)). We are going to try to make the labels more descriptive by adding the percentage value: newlabels <- paste(labels, “”, problems, “%”)

Let’s ask R to draw a pie: pie(problems, main=“Most important problems facing the US”, labels=newlabels, col=rainbow(6)). Pie charts are created with the function pie(x, labels=names) where x is a vector indicating the area of each slice and labels is a character vector of names for the slices. Surely we can do better than that! We are going to install two packages:

• The viridis R package by Simon Garnier. It provides color palettes to make beautiful plots: install.packages(“viridis”).
• The plotrix package. It provides 3D exploded pie charts: install.packages( “plotrix” ).
``````library(plotrix) # Load the plotrix package
library(viridis) # Load the viridis package
problems <- c(25, 20, 8, 8, 8, 31)
labels <- c("Covid", "Government", "Racism", "Economy", "Immigration", "Others")
newlabels <- paste(labels, "", problems, "%")
_pie3D(problems, main="Most important problems facing the US", labels=newlabels, col = viridis(6), explode = 0.3)_
``````

pie3D is the function in the plotrix package that provides 3d pie charts. We use the function viridis() in the viridis package to generate the colors, too. Box Plot in R. x=rnorm(500, mean=0, sd=1). rnorm(n, mean, sd) generates a vector of normally distributed random numbers.

boxplot(x) draws a boxplot for x. A boxplot is a standardized way of displaying the distribution of data based on a five-number summary: minimum (Q0, the lowest data point excluding any outliers), first quartile (Q1), median, third quartile (Q3), and maximum (Q4, the largest data point excluding any outliers). A boxplot is constructed of two parts: a box drawn from Q1 to Q3 with a horizontal line drawn in the middle to denote the median and a set of whiskers. There is a line like a whisker connecting the first quartile to the minimum and a second line from the third quartile to the maximum.

Anything outside the whiskers is considered as an outlier (dots). Let’s create a stem-and-leaf plot with R. It is basically a special table where each data value is split into a “leaf” (the last digit) and a “stem” (all the other digits). It is drawn with two columns separated by a vertical line. The stems are listed to the left of the vertical line. The leaves are listed in increasing order in a row to the right of each stem.

``````edades <- c(35, 36, 36, 35, 57, 69, 24, 25, 21, 39, 37, 30, 29, 28, 34, 40, 20) # edades is a vector representing ages from a given population.
`````` Finally, we are going to visualize some data about the evolution of unemployment in some countries using time plots. Our input data is in csv format with commas as delimiters:

“Country”, “Year”, “Unemployment”
“AUS”,“2005”,5.033881
“AUS”,“2006”,4.78524
“AUS”,“2007”,4.379151
“AUS”,“2008”,4.23433
“AUS”,“2009”,5.560385 ….

``````unemployment <- read.table("/path/Unemployment.csv", header = T, sep = ",")
``````

Reads a file in table format and creates a data frame from it; header = T indicates that the file contains the names of the variables as its first line. sep is the field separator character, values on each line of the file are separated by commas.

``````unemploymentAUS <- subset(unemployment, Country =="AUS", select = c("Year", "Unemployment"))
``````

The subset() function is the easiest way to select variables and observations. We select all rows where “Country” is AUS (AUS is the three-letter country abbreviation for Australia) and keep the Year and Unemployment columns.

``````unemploymentES <- subset(unemployment, Country =="ESP", select = c("Year", "Unemployment"))
unemploymentGBR <- subset(unemployment, Country =="GBR", select = c("Year", "Unemployment"))
unemploymentUSA <- subset(unemployment, Country =="USA", select = c("Year", "Unemployment"))
unemploymentFRA <- subset(unemployment, Country =="FRA", select = c("Year", "Unemployment"))
unemploymentGRC <- subset(unemployment, Country =="GRC", select = c("Year", "Unemployment"))
plot(unemploymentES, col = "violet", xlab = "Time", type = "o", ylim = range(unemployment["Unemployment"]))
# We plot unemployment rate evolution in Spain (x = Year, y = Unemployment), specify the x-axis label (xlab = "Time"), set the limits of the y-axis (ylim = range(unemployment["Unemployment"])) and the type of plot that gets drawn (it will display segments and points, but with the line overplotted).
title("Unemployment evolution") # Define a title of the plot.
lines(unemploymentAUS, col="yellow", type="o") # lines() can not produce a plot on its own. However, it can be used to add lines() on an existing graph. It plots unemployment rate evolution in Australia.
lines(unemploymentGBR, col="red", type="o")
lines(unemploymentUSA, col="blue", type="o")
lines(unemploymentFRA, col="orange", type="o")
lines(unemploymentGRC, col="green", type="o")
legend("topleft", c("ES","AUS", "GBR", "USA", "FRA", "GRC"), lwd=c(5,2), col=c("violet","yellow", "red", "blue", "orange", "green"), y.intersp = 1) # The legend() function is used to add legends to plots. y.intersp is character interspacing factor for vertical (y) spacing. lwd=c(5,2) is used to define the line widths for lines appearing in the legend (other ideas: lwd=c(1), lwd=c(1,2,1,2,1,2), etc.).
`````` # Probability in R

We are going to use the prob package. It is a framework for performing elementary probability calculations on finite sample spaces. Let’s install it: install.packages(“prob”, repos=“http://R-Forge.R-project.org”).

R does not need to be installed. You can also open your favourite browser and open this location: https://rdrr.io/, a comprehensive index of R packages and documentation from CRAN. It lets you run any R code through your browser. ``````library(prob) # Load the library prob.
tosscoin(2, makespace=TRUE) # Sets up a sample space for the experiment of tossing a coin two times with the outcomes "H" (heads) or "T" (tails).
S<-tosscoin(2, makespace=TRUE)
Prob(S, toss1=="H" & toss2=="H")
``````

If we consider all possible outcomes, there was only one (HH, HT, TH, TT) out of four trials in which both coins landed on heads, so the probability of getting heads on both coins is: Prob(S, toss1==“H” & toss2==“H”) = Number of favorable outcomes⁄total number of possible outcomes = 14 = 0.25.

``````rolldie(1, makespace=TRUE)
``````

Sets up a sample space for the experiment of rolling a die once. Since there are six possible outcomes, the probability of obtaining any side of the die is = Number of favorable outcomes⁄total number of possible outcomes = 16 = 0.166… The die is thrown twice. What is the probability of getting a sum of 7?

``````library(prob)
S<-rolldie(2, makespace=TRUE)
Prob(S, X1+X2 == 7)
``````

Prob(S, X1+X2 == 7) = Number of favorable outcomes⁄total number of possible outcomes = 6 (1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3) ⁄ 36 = 636 = 0.166…

A die is thrown twice, what is the probability of getting a doublet? Prob(S, X1==X2) = Number of favorable outcomes⁄total number of possible outcomes = 6 (1, 1),(2, 2), (3, 3), (4, 4), (5,5 ), (6, 6) ⁄ 36 = 636 = 0.166…

A die is thrown twice, what is the probability of getting two 6’s?

When some events do not affect one another, they are known as independent events. Examples of independent events include rolling a die more than once, flipping a coin multiple times, and spinning a spinner. If A and B are independent events, P(A ∩ B) = P(A) P(B).

Prob(S, X1==6 & X2==6) = Number of favorable outcomes⁄total number of possible outcomes = 1 (6, 6) ⁄ 36 = 136 = (The events of rolling a die multiple times are independent) = P(X1==6) * P(X2==6) = 16 * 16 = 136 = 0.02777777777.

What is the probability of getting a doublet given that the sum of two dice is 8?

Prob(S, X1==X2, given = (X1+X2==8))

Conditional probability is the probability of an event occurring given that another event has already occurred. The conditional probability of A given B (it is usually written as p(A/B))= P(A ∩ B)P(B)

P(A/B) = P(A ∩ B)P(B) = 136 (X1=4, X2=4)⁄ 536 ( (2, 6), (6, 2), (3, 5), (5, 3), (4, 4) ) = 15 = 0.2.

What is the probability that the sum of two dice is 8 given that a doublet has been seen?

Prob(S, X1+X2==8, given = (X1==X2))

P(B/A) = P(A ∩ B)P(A) = 136 (X1=4, X2=4)⁄ 636 ( (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6) ) = 16 = 0.166…

Prob(S, X1==6, given = (X1%2==0))

P(A/B) = P(A ∩ B)P(B) = 16 (X1=6)⁄ 36 ( X1 = 2, X1 = 4, X1 = 6 ) = 13 = 0.333…

Prob(S, X1==6, given = ( (X2%2==0) & (X1+X2>8) ) )

P(A/B) =P(A ∩ B)P(B) = 236 ((6, 4), (6, 6))⁄ 636 ( (5, 4), (6, 4), (3, 6), (4, 6), (5, 6), (6, 6) = 26 = 0.333…

Prob(S, X1==6, given = ( (X2%2!=0) & (X1+X2>8) ) )

P(A/B) = P(A ∩ B)P(B) = 236 ( (6, 3), (6, 5) ) ⁄ 436 ( (6, 3), (4, 5), (5, 5), (6, 5)) = 24 = 0.5. # A Gentle Introduction to Bayes Theorem

Let’s illustrate it with an example. There are three kinds of students: procrastinators (0.7), academically, average students (0.2), and over achievers (0.1): prior<- c(0.7, 0.2, 0.1).

The probability of a student passing an exam varies a lot between groups: 0.2 (procrastinators), 0.8 (average students), or 0.9 (over achievers): posteriori<- c(0.2, 0.8, 0.9).

The Bayes’ theorem describes the probability of an event based on prior knowledge of conditions that might be related to the event. It is defined as P(A/B) = P(B/A) * P(A)P(B).

A student has passed his or her last exam but has forgotten to sign on the answer sheet. What is the probability of this student belonging to each of the groups? p(B) = probability that a student passes an exam, p(B) = probability a student fails an exam. P(A1/B) = probability that a student is a procrastinator given that he has passed an exam. P(A1/B) = P(B/A1) * P(A1)⁄P(B) = 0.2 * 0.7 ⁄ P(B) = …

P(B) = { A1 ∪ A2 ∪ A3 = S, the sample space} P(B ∩ A1) + P(B ∩ A2) + P(B ∩ A3) = P(B/A1) * P(A1) + P(B/A2) * P(A2) + P(B/A3) * P(A3) = 0.2 * 0.7 + 0.8 * 0.2 + 0.9 * 0.1 = 0.39

P(A1/B) = 0.2 * 0.70.39 = 0.3589.
P(A2/B) = P(B/A2) * P(A2) ⁄ P(B) = 0.8 * 0.20.39 = 0.4102.
P(A3/B) = P(B/A3) * P(A3) ⁄ P(B) = 0.9 * 0.20.39 = 0.2307.

Let’s do these calculations in R: Bitcoin donation 