A real-world example

We're going to take a look at data on COVID coverage by local news agencies in the US.

There are a number of features in this dataset, there's a more detailed description of most of them here. I'll describe what we're using here, though, obviusly

Note: I am, again, using data that I have used for research in the past. I think it is useful, because I am able to provide more thoughtful responses to questions and comments about the data and methods. It is not because I think my research is especially great, or that you should read it.

Walking the ML Pipeline

Real world goal

Find local news deserts. In other words, find places where people aren('t) likely to be getting adequate news about COVID

Real world mechanism

Identify places where the number of articles that cover COVID is low. We can then try to forecast coverage in locations where we don't have data.

Learning problem

Predict the percentage of weekly coverage devoted to COVID for local news outlets across the country that make data available

Data Collection

Data Representation

(Modded sp?) Pipeline - Let's do some EDA!

Target Class/Model

Linear regression. Now we know what that is! And how to optimize it (although we'll use the sklearn implementation

Training dataset

Picking a "good" training dataset ... getting us started

Next week, we'll cover model evaluation in more detail. For now, we're just going to note that to ensure our model is generalizable and not overfit to the training data, we need to separate out a training dataset from a test dataset.

Exercise: Give a high-level argument for why evaluating on the training data is a bad idea

With temporal data, and more generally, data with dependencies, it is also important that we make sure that we are avoiding the leakage of information from the training data in a way that creates a biased understanding of how well we are making predictions. Leakage can happen in at least two ways (again, we'll go into more detail next week):

For this simple example, for now, we're going to ignore the leakage issue. We'll come back and fix that next week

Picking ... a dataset

Neat! But to pick a good training dataset, we first need to know ... what our dataset is. This data has a lot of features. In class, we'll play with a bunch of them, together. Here, I'm just going to get us started

Model training

OK, let's have at it!

Predict on Test Data

Evaluate error

Deploy (?)

What might we be asking ourselves before we deploy? What might we try to change? Let's work on it!

In class exercise ... beat Kenny's predictive model!

For some reason not in the pipeline ... evaluating coefficients ...

We have to be really careful when interpreting coefficients for models with transformed predictor variables. Here, for example, is a useful resource for your programming assignment.

In our case, we actually ended up using a variable that means interpreting our coefficients using interpretations for logistic regression. Here is a good explanation. We will cover this in more detail next week, but a simple plot below to discuss!

Exercise: Are the estimates from the coefficients we used comparable? Which are, and which are not? What might we do to make them even more comparable?

For this demo, I took code from this sklearn tutorial. The tutorial is very nice, I would highly recommend going through it, although I will teach most of what is in this over the next week or two in one way or another.