= read_csv("https://sta323-sp23.github.io/data/NBA_1516.csv")
NBA = read_csv("https://sta323-sp23.github.io/data/teams.csv") team_id
Lab 6: ascend the likelihood mountain
Due Thursday March 30 5:00pm
“There are two kinds of climbers: those who climb because their heart sings when they’re in the mountains, and all the rest.” - Alex Lowe
By the end of the lab, you will
- implement an MM algorithm to rank sports teams
- construct an MM algorithm from scratch for text document classification
Getting started
In the terminal tab, type
cd ~
to navigate to your home directory. Nextgit clone git@github.com:sta323-sp23/lab-6-team_name.git
whereteam_name
is replaced with your team name (see the excel signup sheet in box or your github).Navigate to your
lab-6
folder and open the.Rproj
file.Open the Quarto (
.qmd
) file, change the author name to your team name followed by a colon and then the names of the team members.
Exercises
For all exercises, you should respond in the space provided in the template lab-6.qmd
and show all your work. In all answers of type double
, three significant figures suffices.
1. rank NBA teams
For this question, we will use the data set NBA
that contains all 1230 regular season NBA games, from 2015-2016 and their outcome.
glimpse(NBA)
Code book:
Home
: unique id for home teamAway
: unique id for away teamY
: whether the home team won (1) or lost (0)
The team_id
data set contains a dictionary to map between the unique ids from NBA
and the actual team name.
Assuming the outcome of each game is independent, we can rank teams and jointly model a home-court advantage in the 2015-2016 season using the Bradley-Terry model described in class.
- Implement the MM algorithm, as described on this slide to estimate \(\hat{\mathbf{p}}_{MLE}\) and \(\hat{\theta}_{MLE}\). Your implementation should be able to be adapted to another data set of identical construction (i.e. don’t hard-code data values into your functions). Start your algorithm mwith \(\theta_0 = 0.5\) and \(\mathbf{p} = \mathbf{1}\) where \(\mathbf{1}\) is a vector of 30 1s. Run the MM updates for 100 steps (i.e. update each element of \(\mathbf{p}\) and \(\theta\) 100 times).
What are the ten highest ranked teams (in order) from the 2015-2016 season according to the Bradley Terry model? Report your estimates \(\hat{\mathbf{p}}_{MLE}\) together with the team name and id number for the top ten teams in a table.
Is there a home-court advantage? What are the odds of winning at home vs away? What are the log-odds?
optional extra credit
- Fix \(p_1 = 1\) in your algorithm. Construct 100 bootstrap data sets (re-sample from the original
NBA
data set with replacement) and re-compute \(\hat{\mathbf{p}}_{MLE}\) and \(\hat{\theta}_{MLE}\) for each. Here we choose100
for low-computational demand.
Report the maximum likelihood estimates (based on the actual, original data set) together with a 95% confidence interval associated with each parameter in a 31 row table. Your table should label each row (except the row associated with \(\theta\)) by the team’s actual name (not just the ID).
2. text classifier
In this exercise we will work towards re-creating the classifier of Ogura, Amano and Kondo (2013) by using a gamma-Poisson distribution to model the frequency of words in a document. Read section 3.1 of the article for relevant background about this model.
The model
Let \(N_d\) be the total number of documents and let \(N_v\) be the total number of unique words in all documents under study. In our model, we assert that the number of occurrences, \(X_{ij}\) of word \(j\) in document \(i\) is Poisson with mean \(\lambda_j w_i\) where \(w_i\) is the number of words in document \(i\). Furthermore, we assert \(\lambda_j \sim gamma(\alpha_{jc}, \beta_{jc})\), where \(\alpha_{jc}\) and \(\beta_{jc}\) are parameters specific to word \(j\) in a document of class \(c\).
Assuming each document is independent, we write the likelihood
\[ L(\mathbf{\alpha}, \beta) = \prod_{i = 1}^{N_d} \prod_{j = 1}^{N_{v}} \frac{\beta_{cj}^{\alpha_{cj}} \Gamma(x_{ij} + \alpha_{cj})w_i^{x_{ij}}}{ \Gamma(\alpha_{cj}) (w_i + \beta_{cj})^{x_{ij} + \alpha_{cj}} \cdot x_{ij}! }. \]
Note that each document \(i \in \{1, \ldots, N_d \}\) has one of five class labels (e.g. comp.sci
) and so \(c \in \{1, \ldots 5 \}\). Effectively, this is the same as fitting five different models of the following form to five sets of data (1 data set for each class):
\[ L(\mathbf{\alpha}, \beta) = \prod_{i = 1}^{N_d} \prod_{j = 1}^{N_{v}} \frac{\beta_{j}^{\alpha_{j}} \Gamma(x_{ij} + \alpha_{j})w_i^{x_{ij}}}{ \Gamma(\alpha_{j}) (w_i + \beta_{j})^{x_{ij} + \alpha_{j}} \cdot x_{ij}! }. \]
Note that \(\alpha\) and \(\beta\) are vectors. Since we wish to compute the maximum likelihood estimates \(\hat{\alpha}_{MLE}\) and \(\hat{\beta}_{MLE}\), we need to find a function \(g\) that minorizes \(L(\alpha, \beta)\).
part 1
- To begin, expand the \(\Gamma\) function using the property: \(\Gamma(x) = (x-1)!\). Next, write the log-likelihood.
part 2
- Use Jensen’s inequality and the supporting line minorization to construct \(g\). Write down \(g(\alpha, \beta | \alpha_n, \beta_n)\).
part 3
Show that the MM update of \(\beta_j\) is
\[ \beta_{n+1, j} = \frac{N_d \cdot \alpha_{nj}}{ \sum_{i=1}^{N_d} \frac{x_{ij} + \alpha_{nj}}{w_i + \beta_{nj}} } \]
and that the MM update of \(\alpha_j\) is
\[ \alpha_{n+1, j} = \frac{\sum_{i=1}^{N_d} \sum_{k = 0}^{x_{ij} - 1} \frac{\alpha_{nj}}{\alpha_{nj} + k}}{ \sum_{i = 1}^{N_d} \log \left(\frac{w_i + \beta_{nj}}{\beta_{nj}}\right) } \]
Style guidelines
All assignments in this course must employ proper coding style, as outlined below:
All code should obey the 80 character limit per line (i.e. no code should run off the page when rendering or require scrolling). To enable a vertical line in the RStudio IDE that helps guide this, see the style guidelines from lab 0 or ask a member of the teaching team for help.
All commas should be followed by a space.
All binary operators should be surrounded by space. For example
x + y
is appropriate.x+y
is not.All pipes
%>%
or|>
as well as ggplot layers+
should be followed by a new line.You should be consistent with stylistic choices, e.g. only use 1 of
=
vs<-
and%>%
vs|>
Your name should be at the top (in the YAML) of each document under “author:”
All code chunks should be named (with names that don’t have spaces, e.g.
ex-1
,ex-2
etc.)File names in your GitHub repo such as
lab-x.qmd
must not be changed and left as provided. Additionally, your repo must pass certain basic checks. The results of these checks are visible on GitHub via the badges at the top of your README and the actions tab. These are meant to give you feedback around the structure and reproducibility of your repository and assignment - they do not assess the correctness of your work. You should consider them a necessary but not sufficient condition when turning in your work - passing all of the checks simply means your have met a minimum standard of reproducibility for the assignment.
Fundamentally, the check is making sure 1) you only have the files you should in your repository, 2) your .qmd
renders.
If you have any questions about style, please ask a member of the teaching team.
Submitting your lab
To submit your assignment, simply commit and push your completed lab-x.qmd
to your GitHub repo. Your most recent commit 48 hours after the assignment deadline will be graded, and any applicable late penalty will be applied (see the syllabus). For this reason, do not push commits after you are satisfied with your work, or a late penalty will be applied.