final-project

###########################################################

### BACKGROUND / DESCRIPTION ###
#The purpose of this project is to edcuate the reader on different principles and properties of the data curation pipeline. This will include Data Curation, Parsing, Hypothesis Testing, and Predcitive Analysis. Throughout this tutorial we will construct and analyze a dataset inorder to understand and apply these properties.

# In order to analye data, we first need to get some. For this tutorial we iwll be looking at a dataset of the fastest single times to solve a Rubik's Cube done in a regulated competition.Competitive Rubik's Cube solving, also known as speed cubing, is as simple as it sounds: solve a Rubik's cube in the fastest time possible. I am a speedcuber, and I thought it would be interesting to take data from the top 300 single solves and see if I could find some patterns within the data. 

##NOTE##
#The times in this data all range within a couple of seconds. As such, the results that we find and the trends that we discover may only show increases/decreases as small as a second or even less. While generally this is not significant, since the times are all relatively close and really fast, time intervals less than a second are significant. I'll be sure to stress this more as the analysis goes on, but I wanted to address it here before any analysis takes place so you the reader are aware. And with that, let's start looking at some Speedcubing data!


########################
### PARSING THE HTML ###
########################

#In this section we connect to the webpage, and take all the data that we want from the HTML, and tidy it up a bit
url = 'https://www.worldcubeassociation.org/results/e.php?eventId=333&regionId=&years=&show=1000%2BPersons&single=Single' #the URL 
webpage = read_html(url)
page_html <- html_nodes(webpage,'tr')
competitors <- html_text(page_html)
competitors <- competitors[-c(1,2,3,4)] #remove unnecessary <tr>s
#competitors <- competitors[c(1:132,134:145,147:176,178:179,181:201,203,205:213,215:226,228:300)]
competitors <- competitors[c(1:132,134:145,147:177,179:181,183:203,205,207:215,217:229,231:300)]

#several of the tags were malformed, and I tried various methods of cleaning them or fixing them but nothing was that great of a solution. So in the end I decide that the best thing to do would be to remove them from this analysis. I looked through each entry, and none of the severly affect the analysis.
# The entries that were removed were: 133, 146, 178, 182, 204, 206, 216, and 230. 
competitors <- str_replace_all(competitors, "Ø", "O") # fixing some chars to help with parsing
competitors <- str_replace_all(competitors, "ø", "o")

###############
### TIDYING ###
###############

# After getting the data from each <tr>, it's difficult to break it up into each section due to how it is formatted. So, what I will attempt in the following code blocks will be to insert "--" between each section in the string, and then later split on that "--" in order to get each section.

### FIX MISSING DATA ###
#Some data is missing. Specifically, some entries don't have a number associated with them (because they are tied in the database) so here we add a number to those entries.
#Also, if there is no number at the front, add -- to the front
i = 1
curr = 1
for(string in competitors){
  c <- substring(string, 1,1)
  if(c!=1&c!=2&c!=3&c!=4&c!=5&c!=6&c!=7&c!=8&c!=9){ #if there is no number in the first char of the string
    n <- paste(i, "--") #add the seperator "--"
    competitors[i] <- paste(n,competitors[i]) #update each string
  } else {
    curr = i
  }
  i <- i+1
}

### REGEXES TO SPLIT THE DATA ###
#Next, we're going to use some regexes to split the data more and add some more of the markers "--" to divide the data

competitors <- competitors %>% 
  gsub("\\(.*\\)","",.) %>% #remove (...) that exist in the data
  gsub("([A-z | //)])(\\d\\.\\d\\d)", '\\1--\\2', .) %>% #used to split the time from the name 
  gsub("(\\d)([A-z])", '\\1--\\2', .) #also used to split the time from the name

#the person at 25 had a weird char in their name that the regexes didn't like, so I handled splitting his name seperately heres
rep <- "25--"
competitors[25] <- paste(rep, gsub('25','',competitors[25]))
competitors <- strsplit(competitors, "--")

### CREATING CHARACTER VECTORS FOR THE TABLE ###
#Next, we want to make each of the columns in the table. So, we can use a for loop to put the parts of each string into an array that will become a column

rankC <- character()
personC <- character()
resultC <- character()
restC <- character()

for(p in competitors){
  rankC <- c(rankC, p[1])
  personC <- c(personC, p[2])
  resultC <- c(resultC, p[3])
  restC <- c(restC, p[4])
}

### DEALING WITH RESTC ###
#The 'Citizen of' and Competition part of each data was a bit too hard to parse in the previous section due to how the <tr> formatted it, so I decided to parse them here instead.

restC <- restC %>% #here I add the marker bewteen the two sections
    gsub("([a-z])([A-Z]+)",'\\1--\\2',.)

restC <- stri_split_fixed(str = restC, pattern = "--", n = 2) #then I split the string on the pattern "--"

citizenC <- character()
competitionC <- character()

for(c in restC){ #lastly I go through and separate the strings into the citizen of and competition sections
  i<-0
  for(j in c){
     if(i==0){
       citizenC<-c(citizenC,j)
       i<-1+1
     } else {
       competitionC<-c(competitionC,gsub("--","",j))
     }
  }
}


### MAKE THE DATA FRAME ###
#Lastly, we take all the vectors we made and put it into a data frame to make the table
#Now that we have this table, let's take a look at the data. We can use different graphs to see the relationships between different parts of the data


df <- data.frame(rankC,personC,resultC,citizenC,competitionC,stringsAsFactors=FALSE)
names(df) <- c("Rank","Person","Result","Citizen of", "Competition")
df$Result <- as.double(as.character(df$Result)) #change the type of result to double
df$Rank <- as.integer(as.character(df$Rank)) #change the type of rank to integer
as.tibble(df)

## # A tibble: 292 x 5
##     Rank Person            Result `Citizen of`      Competition           
##    <int> <chr>              <dbl> <chr>             <chr>                 
##  1     1 Feliks Zemdegs      4.22 Australia         Cube for Cambodia 201~
##  2     2 "SeungBeom Cho "    4.59 Republic of Korea ChicaGhosts 2017      
##  3     3 Patrick Ponce       4.69 United States     Rally In The Valley 2~
##  4     4 Mats Valk           4.74 Netherlands       Jawa Timur Open 2016  
##  5     5 Bill Wang           4.76 Canada            Pickering Spring 2018 
##  6     6 "  Drew Brads"      4.76 United States     Bluegrass Spring 2017 
##  7     7 Max Park            4.78 United States     Skillcon 2017         
##  8     8 Blake Thompson      4.86 United States     Queen City 2017       
##  9     9 Antonie Paterakis   4.89 Greece            The Hague Open 2017   
## 10    10 Lucas Etter         4.90 United States     River Hill Fall 2015  
## # ... with 282 more rows

### ADDING A YEAR COLUMN ###

#Now that we have this table, let's take a look at the data. We can use different graphs to see the relationships between different parts of the data

# there are two main ways that we will start looking at the data: How peoples time relate to the year in which their solved happened, and how peoples' solves relate to each other

#The first thing we can look at is look at all the times across the years and see if there is a trend of people getting faster. To do this we must first make a new column, year.

df["Year"] <- str_extract(df$Competition,"\\d{4}") #get the year, make a new column
df$Year <- as.integer(as.character(df$Year)) #change the type of year to integer
df[136,]$Year <- "2017" # the year got skipped while scraping, so I am manually adding it in back here
as.tibble(df)

## # A tibble: 292 x 6
##     Rank Person            Result `Citizen of`      Competition      Year 
##    <int> <chr>              <dbl> <chr>             <chr>            <chr>
##  1     1 Feliks Zemdegs      4.22 Australia         Cube for Cambod~ 2018 
##  2     2 "SeungBeom Cho "    4.59 Republic of Korea ChicaGhosts 201~ 2017 
##  3     3 Patrick Ponce       4.69 United States     Rally In The Va~ 2017 
##  4     4 Mats Valk           4.74 Netherlands       Jawa Timur Open~ 2016 
##  5     5 Bill Wang           4.76 Canada            Pickering Sprin~ 2018 
##  6     6 "  Drew Brads"      4.76 United States     Bluegrass Sprin~ 2017 
##  7     7 Max Park            4.78 United States     Skillcon 2017    2017 
##  8     8 Blake Thompson      4.86 United States     Queen City 2017  2017 
##  9     9 Antonie Paterakis   4.89 Greece            The Hague Open ~ 2017 
## 10    10 Lucas Etter         4.90 United States     River Hill Fall~ 2015 
## # ... with 282 more rows

### CONSTRUCTING GRAPHS FROM THIS DATA ###
#This graph shows each person's times related to each other (i.e. ordered by their rank)

df %>% 
  ggplot(mapping=aes(x=Rank, y=Result)) + geom_point() + geom_smooth() + ggtitle("Scatterplot - Rank vs. Result (time)")

#we can see that this follows a logarithmic curve, so I have fittted on along the graph

#This graph shows each time in relation to the year
df %>% 
  ggplot(mapping=aes(x=factor(Year), y=Result)) +  geom_boxplot() + geom_smooth(method=lm) + ggtitle("Boxplot - Result (Time) vs. Year")

#As we can see with the regression line, this one definitely follows a more linear trend
#Interestingly, it seems that as the years go on, there isn't much variation in how the median of the times changes, and the 1st and 3rd quartiles vary a lot, but the outliers are what stick out. Now that we have two different ways of looking at this data, lets look at what each plot tells us.

##########################
### HYPOTHESIS TESTING ###
##########################

# Now that we have 2 charts, we have a dilmena: In general, does this data follow more of a linear trend like we saw with the box plot, or does it have a more logarithmic curve to it as we saw in the scatterplot? Or maybe it follows both depending on how we look at the data? This is the first question that we will be looking at in the first part of this analysis. In order to identify how this data chanegd over time, we can view it using different crieria, starting with looking at the mean over time


# The boxplot was helpful, but let's graph the mean of each year along with a regression line to see how the mean changes over time
#df$Rank <- as.integer(as.character(df$Rank)) #change the type of rank to integer

df$Year <- as.numeric(as.integer(df$Year))
df %>% group_by(Year) %>%
  summarize(minTime = mean(Result)) %>%
  ggplot(mapping=aes(x=Year, y=minTime)) +
    geom_point()  + geom_smooth(method = lm) + ylab("Mean Time of Solves") + ggtitle("Mean Time of Solves per Year")

#  ggplot(mapping=aes(x=Year, y=minTime)) +
#    geom_point()  + geom_smooth() + ylab("Mean Time of Solves") + ggtitle("Mean Time of Solves per Year")

#now that we have this linear regression line, let's calculate what it slope is, or in other words how much timers decrease on average per year. To do this, we can make a linear model and a confidence interval, as shown below

#grab the min time for each year
mean_res <- df %>% group_by(Year)  %>% summarize(minTime = mean(Result))

#Linear Model
linearMod<- lm(minTime~Year, data=mean_res)

#Confidence Interval
confidenceInf <- linearMod %>%
  tidy() %>%
  select(term, estimate, std.error)

confidence_interval_offset <- 1.95 * confidenceInf$std.error[2]
confidence_interval <- round(c(confidenceInf$estimate[2] - confidence_interval_offset,
                               confidenceInf$estimate[2],
                               confidenceInf$estimate[2] + confidence_interval_offset), 4)
confidence_interval

## [1] -0.0867 -0.0386  0.0096

#Given these calculations, we can now see what this data is telling us: On average, the mean time that it takes to solve a cube decreases by 0.04 seconds per year. This may not seem like a lot, but at the professional level where solves are below 6.5 seconds, this is actually pretty good. That means that over this 6 year time period, the average time decrease by almost a quarter of a second.```

# Another thing aspect we saw of the boxplot was that the outliers seemed to decrease every year. Let's take a look at the minium from each year.

df %>% group_by(Year) %>%
  summarize(minTime = min(Result)) %>%
  ggplot(mapping=aes(x=Year, y=minTime)) +
     geom_text(aes(label = minTime)) + ylab("Fastest Solve") + ggtitle("Fastest Solve for each Year - Linear") +  geom_smooth(method=lm)

# In every year except for 2014, the minimum goes down. This means that the World Record every year (except 2014) has gone down

#there are 2 interesting things I'd like to note about this graph. First, the 6.41 labeled in the 2014 column. This is the only value that doesn't fit this trend of the World Record (WR) decreasing every year. This is an interesting anomaly in the trend, but all we can tell is that 2014 just wasn't a lucky year. A lot of these best singles usually have some amount of luck involved, as I can tell with experience.

##################
### PREDICTION ###
##################

#Secondly, I want to draw attention to the 4.22 that is in the 2018 spot. This record actually just happened on May 6, 2018, and was broken by an Australian names Feliks Zemdegs. Before this, the best time achieved in 2018 was also 4.59, achieved by Feliks and tying the World record with SeungBeom Cho, who also got a 4.59 in 2017. This is actually a really impressive feat by Felix, for cutting down the World Record by .3 seconds. But by looking at the best times for each year, could we have predicted this?

#Let's look back at the graph above, but exclude the 2014 datapoint and instead use a linear model.
#We can replicate the code we used above

min_mean <- df %>% group_by(Year) %>% filter(Year != "2018") %>% summarize(minTime = min(Result))

#Linear Model
linearMod<- lm(minTime~Year, data=min_mean)

#Confidence Interval
confidenceInf <- linearMod %>%
  tidy() %>%
  select(term, estimate, std.error)

confidence_interval_offset <- 1.95 * confidenceInf$std.error[2]
confidence_interval <- round(c(confidenceInf$estimate[2] - confidence_interval_offset,
                               confidenceInf$estimate[2],
                               confidenceInf$estimate[2] + confidence_interval_offset), 4)
confidence_interval

## [1] -0.5761 -0.3620 -0.1479

# here, we can see that this model predicts that each year the WR goes down ~.36 seconds. Doing the math, we get 
print(4.59 - 0.36) # = 4.23, one hundredth of a second away from what actually occurred! So, we can see that, for this set, the linear model prediction is actually pretty accurate.

## [1] 4.23

#HOWEVER, and this is the important part, the best time can't keep decreasing at a steady rate every year like we assumed with the linear model above. It is unlikely that anyone will get an official solve that is below 4 seconds next year like this method suggests. If we had more data from previous years I suspect it would look more like a logarithmic curve. And by looking at the graph with a logarithimc line on it (translated along the graph to map on the data) we can see that this makes sense:

df %>% group_by(Year) %>%
  summarize(minTime = min(Result)) %>%
  ggplot(mapping=aes(x=Year, y=minTime)) +
     geom_text(aes(label = minTime)) + ylab("Fastest Solve") + ggtitle("Fastest Solve for each Year - Logarithmic")  + stat_function(fun=function(x) -log(x-2011) +6.3)

#This can partially be seen in the first graph we made plotting rank vs time and adding a logarithmic regression line. We can see that the log curve doesn't really fit the few fastest couple times, and Feliks' 4.22 is pretty far off from the curve, suggesting that it is not very predictable. So while the data may look linear, a combination of luck and not having enought data has fooled us into thinking this, and if we step back from the problem for a minute and think about what is actually happening, we can see that this is NOT a linear trend. 

#Over the next few years this linear trend will definitely  plateau out into a logarithmic curve, but like I said above that data just doesn't exist yet. Still, I recommend to the reader to look back in a few years and see if this turns out to be the case or not!


### HYPOTHESIS CONCLUSION ###

#as stated above, we can see that this trend of how times change of the years is logarithmic all around, not linear or both as some of the data suggested. And also as stated above, this makes sense because solve times can't decrease linearly (they would hit 0 or get negative, both of which are impossible); they have to be logarithmic. This is a great example of knowing what kind of analysis to do, and how to use commons sense along with what you predict, in case your hypothesis is wrong but outlier seems to support it, which is exactly what happened here.

#Lastly, I'd like to make a small note that Feliks' 4.22 really is amazing. Knowing that we have a logarithmic regression, it is well below what is predcited, and it's really cool that he was able to achieve such a fast time.

### ANALYSIS OF COUNTRIES IN TOP 300 DATA ###

#the next thing I wanted to look at was the spread of countries represented in this database. So, I went through each country, and found how many competitors were from each one. Then I used the rworldmap library to make a heatmap showing how many competitors were from each country

library(rworldmap)
  country2<- unique(df$`Citizen of`) #make the vector listing all the countries, getting rid of duplicates

#store the number of competitors from each country
value2 <- character()
  for(c in country2){
      value2 <- c(value2, str_count(df,c)[4])
  }

value2 <- as.integer(as.character(value2)) #change the type of result to double

#make the data frame holding these values
d <- data.frame(
  country=country2,
  value=value2)

#make the heatmap
n <- joinCountryData2Map(d, joinCode="NAME", nameJoinColumn="country")

## 46 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 197 codes from the map weren't represented in your data

mapCountryData(n, nameColumnToPlot="value", mapTitle="Number of Competitors per Country",oceanCol="lightblue", missingCountryCol="grey")

# As we can see form this data, the majority of competitors are from the U.S. In fact, 85 out of the 300 top solves were achieved by U.S. Speedcubers, which is above 1 in 3.5. This is pretty cool, because it shows that the U.S. is very predominant in the cubing world.

#Since the U.S. seemed to make a big skew on this heat map, I decided to see what it would look like if we excluded the U.S. from the map.

### HEATMAP ANALYSIS EXCLUDING OUTLIER ###
d <- d[-3,] #remove the U.S. from the map

#make the heatmap
n <- joinCountryData2Map(d, joinCode="NAME", nameJoinColumn="country")

## 45 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 198 codes from the map weren't represented in your data

mapCountryData(n, nameColumnToPlot="value", mapTitle="Number of Competitors per Country Excluding U.S.",oceanCol="lightblue", missingCountryCol="grey")

#now, we can see the yellow-orange-red split much clearer. Countries such as China, Australia, and Russia (and the U.S.) are pretty well represented in the sample, Scandinavia fall in the middle, and places such as Mexico and Algeria only have 1 or 2 competitors in the top 300 fastest solves.

### MAKE US AND "EVERYONE ELSE" TABLE" ###
#originally, I wanted to look at the data betwen women and men, and see which gender had better statistics. But in the top 300 there were only 6 women competitors I could find, so I instead decide to look at how the U.S. stood up to the rest of the world in term of speedcubing ability.

### MAKE THE US TABLE
US <-  character()
for(curr in 1:nrow(df)){ # find evert row that has a cuber from the U.S.
  if(df[curr,]$`Citizen of` == "United States"){
    US <- c(US,curr)
  }
}

US_df <- df[US,] #copy all those rows into a new daat frame
as.tibble(US_df)

## # A tibble: 86 x 6
##     Rank Person         Result `Citizen of`  Competition              Year
##  * <int> <chr>           <dbl> <chr>         <chr>                   <dbl>
##  1     3 Patrick Ponce    4.69 United States Rally In The Valley 20~  2017
##  2     6 "  Drew Brads"   4.76 United States Bluegrass Spring 2017    2017
##  3     7 Max Park         4.78 United States Skillcon 2017            2017
##  4     8 Blake Thompson   4.86 United States Queen City 2017          2017
##  5    10 Lucas Etter      4.90 United States River Hill Fall 2015     2015
##  6    15 Keaton Ellis     5.08 United States Slow N Steady Summer 2~  2017
##  7    18 Rami Sbahi       5.22 United States Shaker Fall 2016         2016
##  8    20 Collin Burns     5.25 United States Doylestown Spring 2015   2015
##  9    22 Dana Yi          5.37 United States Slow N Steady Summer 2~  2017
## 10    28 "  Jeff Park"    5.52 United States Maryland 2017            2017
## # ... with 76 more rows

### MAKE THE NON-US TABLE ###
US <- paste("-",US,sep="") # add a minus to each item in the string
US <- as.integer(as.character(US))

NON_US_df <- df[US,] # remove each US item
as.tibble(NON_US_df)

## # A tibble: 206 x 6
##     Rank Person               Result `Citizen of`      Competition    Year
##  * <int> <chr>                 <dbl> <chr>             <chr>         <dbl>
##  1     1 Feliks Zemdegs         4.22 Australia         Cube for Cam~  2018
##  2     2 "SeungBeom Cho "       4.59 Republic of Korea ChicaGhosts ~  2017
##  3     4 Mats Valk              4.74 Netherlands       Jawa Timur O~  2016
##  4     5 Bill Wang              4.76 Canada            Pickering Sp~  2018
##  5     9 Antonie Paterakis      4.89 Greece            The Hague Op~  2017
##  6    11 "  Seung Hyuk Nahm "   4.90 Republic of Korea China Champi~  2017
##  7    12 "Hyo-Min Seo "         4.94 Republic of Korea Korean Champ~  2016
##  8    13 "  Kevin Gerhardt"     4.94 Germany           German Natio~  2017
##  9    14 Philipp Weyer          5.05 Germany           German Natio~  2017
## 10    16 Alexandre Carlier      5.19 France            France 2018    2018
## # ... with 196 more rows

### COMPARE AND CONTRAST US AND REST OF WORLD ###

#Now that we have out 2 tables for the U.S. and the rest of the world, we can compare them. Let's start by looking at each of their averages over the years

p1 <- US_df%>% group_by(Year) %>%
  summarize(minTime = mean(Result)) %>%
  ggplot(mapping=aes(x=Year, y=minTime)) +
  geom_point()  + geom_smooth() + ylab("Mean Time of Solves") + ggtitle("Mean Time / Year - U.S.")

p2 <- NON_US_df%>% group_by(Year) %>%
  summarize(minTime = mean(Result)) %>%
  ggplot(mapping=aes(x=Year, y=minTime)) +
  geom_point()  + geom_smooth() + ylab("Mean Time of Solves") + ggtitle("Mean Time / Year - All but U.S.")

plot_grid(p1,p2)

# These plots show is that in the U.S. the averages are actually pretty varied, there isn't really a trend from year to year, and you can't really predict anything. Compared to the rest of the world (and the entire world including the U.S. like an earlier plot) we see a distinct logarithmic trend going down. So while the U.S. may have done better that the rest of the world in 2012-2015, the past couple years seem to be in flux. Based on these graphs, we can see that the rest of the world is starting to catch up to the U.S. In the future I'm sure we will start seeing less U.S. representation in this top 300 list and more variation with other countries cubers.

#And just as one quick aside, we can also look at which country set the World record evrey year:

best <- character() # vector to store best from each conutry
#flags to mark if that country has been seen yet or not
flag2012 <- 0 
flag2013 <- 0
flag2014 <- 0
flag2015 <- 0
flag2016 <- 0
flag2017 <- 0
flag2018 <- 0

for(curr in 1:nrow(df)){ # find evert row that has a cuber from the U.S.
    if(df[curr,]$Year == 2012 && flag2012 == 0){
      flag2012 <- 1
      best <- c(best,paste("2012",df[curr,]$`Citizen of`,sep="-"))
    } else if(df[curr,]$Year == 2013 && flag2013 == 0){
      flag2013 <- 1
      best <- c(best,paste("2013",df[curr,]$`Citizen of`,sep="-"))
    # 2014 is not included because the WR was not broken that year  
    #} else if(df[curr,]$Year == 2014 && flag2014 == 0){
    #  flag2014 <- 1
    #  best <- c(best,paste("2014",df[curr,]$`Citizen of`,sep="-"))
    } else if(df[curr,]$Year == 2015 && flag2015 == 0){
      flag2015 <- 1
      best <- c(best,paste("2015",df[curr,]$`Citizen of`,sep="-"))
    } else if(df[curr,]$Year == 2016 && flag2016 == 0){
      flag2016 <- 1
      best <- c(best,paste("2016",df[curr,]$`Citizen of`,sep="-"))
    } else if(df[curr,]$Year == 2017 && flag2017 == 0){
      flag2017 <- 1
      best <- c(best,paste("2017",df[curr,]$`Citizen of`,sep="-"))
    } else if(df[curr,]$Year == 2018 && flag2018 == 0){
      flag2018 <- 1
      best <- c(best,paste("2018",df[curr,]$`Citizen of`,sep="-"))
    }
}

best

## [1] "2018-Australia"         "2017-Republic of Korea"
## [3] "2016-Netherlands"       "2015-United States"    
## [5] "2013-United Kingdom"    "2012-Japan"

# 2012 - Japan
# 2013 - United Kingdom
# 2014 - Not Broken
# 2015 - United States
# 2016 - Netherlands
# 2017 - Republic Of Korea
# 2018 - Australia (current)

#As we can see, the U.S. has had the best record of the year only once in the past 6 years. So by that statistic (and fastest solve is usually more impressive to people than average) the U.S. really isn't doing that well. But that's one of the great things about speedcubing; it is a worldwide activity that anyone can compete and become good in, and even cooler is that you get to meet people all over the world who have the same hobby as you.

##################
### CONCLUSION ###
##################

# So, to conclude. We looked at how to parse data from a website, and how to tidy and fix up all that data. Then we looked at a few hypotheses trying to see how the times were spread within this top 300 speedcubing solves data, as after testing found that it has a logarithmic relation. Next, we mapped the ratio of competitors to each other based on country of origin and found that the U.S. is predominant in this data. However, they have not done as well as one might suspect based on their size in the data in having the best single solve for each year, and in fact the rest of the world seems to be catching up to the U.S. Hopefully this game you a good idea of how that data science pipeline works, and I hope you got a feel for what speedcubing and speedcubing trends are like. There are still many more questions that could be answered, and as more data comes I'd like to try and see what else I can find and if some of my predictions are true. I hope this was informative! 

#If you want to see all of the data for every speedcuber and speedcubing competition, the World Cube Association has a fantastic website you can check out: https://www.worldcubeassociation.org/

CMSC320 Final Project - Speedcubing Data Analysis

BY: Jacob Elspas