Jedediyah Williams, PhD
jedediyah@gmail.com
Belmont High School
jedediyah.github.io/csta2024
![]() |
cheese not sticking to pizza ↓
... add about 1/8 cup of glue
|
![]() |
how many rocks should I eat ↓
... at least one small rock per day
|
A common misconception is that
If the problem isn't solved yet, it's just because you haven't added enough technology yet!
![]() |
"Not only do many of the hiring tools not work, they are based on troubling pseudoscience and can discriminate" |
"Our success, happiness, and wellbeing are never fully of our own making. Others' decisions can profoundly affect the course of our lives...
Arbitrary, inconsistent, or faulty decision-making thus raises serious concerns..."
- Fairness and Machine Learning, Barocas, Hardt, and Narayanan
"LLM chatbots have been designed in a way, known by psychologists and ethicists, to trick humans into believing they are intelligent."
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Data |
1. Get the data |
Preprocess |
2. Clean up the data |
Explore |
3. Explore the data |
Model |
4. Model it |
Communicate |
5. Share the results |
• Design
∘ Turn a problem into a data-problem.
∘ Survey or experimental design
∘ Database infrastructure
• Acquire
∘ Survey or experiment
∘ Download the dataset! CSV, API, etc.
∘ Web scraping
• Wrangle
∘ Format
∘ Clean and organize
∘ Check data integrity
• Prepare
∘ Label
∘ Split into training and testing sets
∘ Normalize
• Visualize
∘ Plot and familiarize with data
∘ Look for and compare features visually
∘ Consider appropriate models
• Inspect
∘ Exploratory data analysis
∘ Descriptive statistics
∘ Identify features analytically
• Model
∘ Try and compare multiple models
∘ Consider bias and variance
∘ Interpret model and performance
• Validate
∘ Assess model performance on independent test data
∘ Error analysis and stress-test
∘ Consider consequences
• Reflect
∘ Consider contexts, bias, and consequence
∘ Create audit plant
∘ Document - data and model
• Share
∘ Report documentation
∘ Inform policy
∘ Deploy in product
Environment
Data
|
• Harmful data collection, lack of consent, insecure / lack of privacy, historical, representational, or measurement bias, ...
|
Preprocess
|
• Labor exploitation, labeling by non-experts, incorrect labeling, trauma experienced by labelers, ...
|
Explore
|
• Feature selection bias, bias in interpretation of data visualization, data manipulation, feature hacking, ...
|
Model
|
• Bias in model choice, model-amplified bias, environmental impact, learning bias, evaluation bias, peripheral modeling, ...
|
Communicate
|
• Biased model interpretation, ignoring variance, rejecting model, deploying harmful products, deployment bias, ...
|
Meta
|
• "Pernicious feedback loops", runaway homogeneity, susceptability to adversarial attack, lack of oversight or auditing, ...
|
|
|
Data
|
• Data problem: What will be the bounce height \(h_{bounce}\) of my bouncy ball when dropped from rest from a given drop height \(h_{drop}\)?
• Record several slow-motion videos. |
Preprocess
|
• Randomly choose a subset of videos as the training set.
• Parse the training set videos into a table. |
Explore
|
• Create a scatter plot of \(h_{bounce}(h_{drop})\)
• Look for features! Notice and wonder. Consider models. |
Model
|
• Find a best-fit model on the training data.
• Validate the model on the testing data. |
Communicate
|
• Reflect on the process.
• Share out. |
![]() |
![]() |
Training Data | Testing Data |