DS from S: Chapter 11

Machine Learning

Jul 30, 2024

Some family things came up and it’s taken me awhile to do this chapter. Should be able to get the next few chapters out sooner rather than later, though.

Thought before starting

Looking forward to this chapter. I’m thinking (once again) that this will be the first chapter where the content is genuinely new (rather than just some new tooling).

Thoughts while reading

Initially pretty conceptual: “what is machine learning?” kind of stuff (it’s even the title of a subsection).

Overfitting & underfitting are pretty familiar concepts for me from physics.

Getting some actual vocabulary I plan to use here; creating a Vocab section.

Definitely learning some new things in this chapter, though most of it is conceptual. Looks like we’ll be building on this soon!

What I learned

Shallow copy a list with data2 = data[:]!
- I used to always used some version of data2 = copy.copy(data) from the copy library
Lots of vocabulary terms

What I liked

Discussion of what can go wrong in splitting training & test data with author’s encounter with one such problem

What I disliked

Other thoughts

Vocabulary terms

Predictive modeling
Data mining
Machine learning
Supervised learning: data is labeled with the correct answers to learn from
Unsupervised learning: data has no annotations; can’t predict the “correct” answers but you might find correlations
Online learning: model continually adjusts itself based on new information
Reinforcement: model gets a signal saying how well its predictions perform
Training data set: to train the model
Validation data set: to choose quality from among candidate models
- May not be necessary if you’re not choosing from among candidates
Testing data set: used to judge the quality of the final model
Correctness terms:
- True positive: predicted positive & actual positive
- False positive (Type 1 error): predicted positive & actual negative
- False negative (Type 2 error): predicted negative & actual positive
- True negative: predicted negative & actual negative
- Precision: how accurate the positive predictions are
- Recall: what fraction of positives our model predicted
- F1 Score: Harmonic mean of precision and recall
  - Note: harmonic mean is the the inverse of the mean of the inverses, i.e., average speed for two trips of equal length is the harmonic mean of the speeds.
Bias: error from bad assumptions in the model; you miss the actual connections between features & target outputs
- High values can come from underfitting the data, or from other sources
Variance: error from sensitivity to minor variations in the data
- High values can come from overfitting the data, or from other sources
Bias-variance tradeoff
- High bias & low variance: underfit
- Low bias & high variance: overfit
Feature: any input I provide to my model

William’s Substack

Discussion about this post

Ready for more?