Data and Modeling

jedediyah.github.io/nctm2022

Breaking Bias in Data and Modeling

Jedediyah Williams, PhD
Nantucket High School

September, 2022

@jedediyah
jedediyah.com/nctm2022

About Jed

Teaching

→

Astronomy

→

Robots

→

Teaching

DATA: There is too much to talk about!

Power
Surveillance
Privacy
Security
Consent
Access
Fairness

Education
Energy
Military
Misuse
Adversarial Attacks
Disinformation
Liberty

Discrimination
Labor
Environment
Exploitation
Law and Oversight
Accountability
Justice

Data Ethics
AI Ethics

Fair ML
Fair AI

Algorithmic Bias
Data Bias

axioms

Math is awesome.
Causing unnecessary harm is bad.

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

Modeling with data
Teaching and scafolding modeling with data
Critically analyzing data technologies

"Yes, train these young people to get these skills, but integrate into that not only the technical capacity but the critical capacity to question what they're doing and what's happening. To me, it is not true empowerment unless people can have the power to question how these skills are going to be used."

- Ruha Benjamin
"The New Jim Code? Race, Carceral Technoscience, and Liberatory Imagination"

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

When we approach modeling as a series of design choices, we highlight the assumptions and subjectivity of value judgements made at each stage and begin to expose the inherent biases embedded within our models.

1 minute, talk to your neighbors:

How can
math
cause harm?

When we say that we are teachers of "mathematics", which "mathematics" are we talking about?

Authors

Reading

Watching

Cathy O'Neil

Weapons of Math Destruction (2016)

Viginia Eubanks

Automating Inequality (2018)

Automating Inequality
PBS 2018

Safiya Umoja Noble

Algorithms of Oppression (2018)

Meredith Broussard

(2019)

Janelle Shane

(2019)

The danger of AI is weirder than you think
TED 2019

Hannah Fry

( )

Should Computers Run the World?
Royal Institution 2019

Caroline Criado Perez

( )

Invisible Women
Engage 2019

Ruha Benjamin

( )

Ruha's resources for Race After Tech

Melanie Mitchell

( )

The Collapse of Artificial Intelligence
Santa Fe Institute 2019

Sasha Costanza-Chock

Design Justice (2020)

Kate Crawford

(2021)

Catherine D'Ignazio &
Lauren F. Klein

( )

Wait!
Isn't math objective and neutral?

Let's ignore philosophy, paradoxes, incompleteness, decidability, the unfinished state of mathematics; there is still complexity.

"However, two major discoveries of the twentieth century showed that Laplace's dream of complete prediction is not possible, even in principle...

It was the understanding of chaos that eventually laid to rest the hope of perfect prediction of all complex systems, quantum or otherwise." (Mitchell, 2019, p. 20)

"But even if it were the case that the natural laws had no longer any secret for us, we could still only know the initial situation approximately.
...
it may happen that small differences in the initial conditions produce very great ones in the final phenomenon.
...
Prediction becomes impossible."
(Poincaré, 1908, as cited in Mitchell, 2019, p. 21)

https://twitter.com/standupmaths/status/741251532167974912

"The lack of humility before nature that's being displayed here staggers me." - Malcolm, Jurassic Park

Data modeling applications

Search engine
Recommendation systems
Ranking systems
Application / resume filtering
Computer vision
Chat bots
Policing
Sentencing and parole
"Self-driving" vehicles
...

"Our success, happiness, and wellbeing are never fully of our own making. Others' decisions can profoundly affect the course of our lives...

Arbitrary, inconsistent, or faulty decision-making thus raises serious concerns..."

- Fairness and Machine Learning, Barocas, Hardt, and Narayanan

What are some consequences of data technologies?
↓

Some of the more well known harms

https://www.nytimes.com/2019/08/16/technology/ai-humans.html

https://www.attendeeinteractive.com/privacy-policy/

Anatomy of an AI system, Crawford and Joler

Adversarial attack

Algorithms are brittle - Melanie Mitchell

Lack of oversight or auditing

The act, by those in power, of making decisions for us is a display of the imbalance of power.
- Sun-ha Hong, Prediction as Extraction of Discretion

You are being surveilled.

You are being experimented on.

Big Picture

When handing over the tools of mathematics,
we are responsible as educators
for teaching their responsible use.

It is a sin of omission when we fail to acknowledge the consequences of the content we teach; Consequences which include ethical and technical pitfalls.

Subtle picture

There is no simple solution. There is no checklist that if you've done these things then you won't cause harm.
Many ethical concerns are technical concerns.
Predicting, detecting, and mitigating harm and discrimination in data technologies are complex and active areas of research.

Fayyad et al (1996). The KDD Process for Extracting Useful Knowledge from Volumes of Data
(Knowledge Discovery in Databases)

Chapman et al (1999), Wirth (2000). "Towards a standard process model for data mining".

1. Obtain: pointing and clicking does not scale
2. Scrub: the world is a messy place
3. Explore: You can see a lot by looking
4. Models: always bad, sometimes ugly
5. INterpret: "The purpose of computing is insight, not numbers."

Mason and Wiggins (2010). "A Taxonomy of Data Science".

Schutt and O'Neil (2014). "Doing Data Science: Straight talk from the frontline".

GAIMME Guidlines for assessment & instruction in mathematical modeling education (2016).

Guidelines for Assessment and Instruction in Statistics Education (GAISE) Reports
(2020, based on 2007).

Estrellado et al (2020). Data Science in Education Using R, Section 3.2.

Zico Kolter (2021). Practical Data Science, Intrdouction

Common Core / Achieve the Core.
I like the video here!

Many frameworks. Much overlap.

Data	1. Get the data
Preprocess	2. Clean up the data
Explore	3. Explore the data
Model	4. Model it
Communicate	5. Share the results

Data	1. Get the data
Preprocess	2. Clean up the data
Explore	3. Explore the data
Model	4. Model it
Communicate	5. Share the results

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

• Design
∘ Turn a problem into a data-problem.
∘ Survey or experimental design
∘ Database infrastructure
• Acquire
∘ Survey or experiment
∘ Download the dataset! CSV, API, etc.
∘ Web scraping

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

• Wrangle
∘ Format
∘ Clean and organize
∘ Check data integrity
• Prepare
∘ Label
∘ Split into training and testing sets
∘ Normalize

Data Splitting

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

• Visualize
∘ Plot and familiarize with data
∘ Look for and compare features visually
∘ Consider appropriate models
• Inspect
∘ Exploratory data analysis
∘ Descriptive statistics
∘ Identify features analytically

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

• Model
∘ Try and compare multiple models
∘ Consider bias and variance
∘ Interpret model and performance
• Validate
∘ Assess model performance on independent test data
∘ Error analysis and stress-test
∘ Consider consequences

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

• Reflect
∘ Consider contexts, bias, and consequence
∘ Create audit plant
∘ Document - data and model
• Share
∘ Report documentation
∘ Inform policy
∘ Deploy in product

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

Data Modeling Process

Environment

→

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

→

A framework for critical analysis

Data	• Harmful data collection, lack of consent, insecure / lack of privacy, historical, representational, or measurement bias, ...
Preprocess	• Labor exploitation, labeling by non-experts, incorrect labeling, trauma experienced by labelers, ...
Explore	• Feature selection bias, bias in interpretation of data visualization, data manipulation, feature hacking, ...
Model	• Bias in model choice, model-amplified bias, environmental impact, learning bias, evaluation bias, peripheral modeling, ...
Communicate	• Biased model interpretation, ignoring variance, rejecting model, deploying harmful products, deployment bias, ...
Meta	• "Pernicious feedback loops", runaway homogeneity, susceptability to adversarial attack, lack of oversight or auditing, ...

"A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle", Harini Suresh and John V. Guttag

https://lighthouse3.com/newsletter/

Critical Questions:

What are the motivations for the project?
What is the intended use?
What is the unintended use or misuse?
Where does the data come from?
Who collects the data?
Who owns the data?
How is the data collected?
How is the data stored?
How old is the data?
When will the data expire?
How will the data be secured?
What happens with the data when the company is sold?
Who does the labeling?
What labels will they decide to use?
Are the labelers experts?
Are the labels accurate?
What biases are represented in the data?
How is data included or excluded?

How are outliers addressed?
What subpopulations are represented?
What subpopulations are over- or underrepresented?
What portions of the data are inspected?
What features are selected for modeling?
What model is chosen?
What features do we think are being modeled?
What latent features are actually being modeled?
What is the domain of the model?
What are the consequences of error?
What decisions will be made with the model?
What biases are perpetuated?
Where will the model be deployed?
What could go wrong?
Who is responsible when things go wrong?
How can issues be reported?
Will new data be fed back in to update the model?

Have you ever read a book in a math class?

Data Modeling Process

Data

→

Preprocess

→

Explore

→

Model

→

Communicate

Modeling with data
Teaching and scafolding modeling with data
Critically analyzing data technologies

How high does a bouncy ball bounce?

Data

Preprocess

Explore

Model

Communicate

How high does a bouncy ball bounce?

Data	• Data problem: What will be the bounce height \(h_{bounce}\) of my bouncy ball when dropped from rest from a given drop height \(h_{drop}\)? • Record several slow-motion videos.
Preprocess	• Randomly choose a subset of videos as the training set. • Parse the training set videos into a table.
Explore	• Create a scatter plot of \(h_{bounce}(h_{drop})\) • Look for features! Notice and wonder. Consider models.
Model	• Find a best-fit model on the training data. • Validate the model on the testing data.
Communicate	• Reflect on the process. • Share out.

How high does a bouncy ball bounce?

Bounce Prediction Error


Training Data	Testing Data

https://reproducible.cs.princeton.edu/#rep-failures

Break models

"How high does a bouncy ball bounce?"

becomes:

"How much can we minimize the error of a linear model when predicting how high this particular bouncy ball will bounce in this room on this surface at this temperature and humidity when dropped from rest at a height of no more than two meters?"

Data Modeling Projects

How high does a bouncy ball bounce?
How far will the ball roll?
What is the period of a pendulum?
When will the water reach 40℃?
When is high tide?
How much daylight will there be on Jan 1?
When will sun set on Feb 1?
What is the best move in Hexapawn?
What is the best move in Tic Tac Toe?
Which NFL team will win Monday?

I have a project for you!
But first, in Summary

Model data. It's awesome.
Break models. Witness them failing.
Critically analyze technology.

Classify these fruit!

Using data pipelines as critical frameworks:

A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle - Harini Suresh, John V. Guttag

Teaching with ethics at the forefront:

Teaching Machine Learning in the Context of Critical Quantitative Information Literacy - Carrie Diaz Eaton.
Integrating the Humanities into Data Science Education - Eric A. Vance et al.

AI Now Institute reports: https://ainowinstitute.org/reports.html
Automating Ambiguity: Challenges and Pitfalls of Artificial Intelligence - Abeba Birhane
On the dangers of stochastic parrots: Can language models be too big? - Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell
Rachael Tatman - YouTube
Rachel Thomas Fast.ai Data Ethics Course
Joy Buolamwini https://www.media.mit.edu/people/joyab/publications/
SERJ special issue: https://iase-web.org/ojs/SERJ/issue/view/28
AIES '22: Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society https://dl.acm.org/doi/proceedings/10.1145/3514094

Education:
Teaching Machine Learning in the Context of Critical Quantitative Information Literacy
Integrating data science ethics into an undergraduate major: A case study
A call for a humanistic stance toward k-12 data science education
Artificial intelligence in education: Addressing ethical challenges in K-12 settings
Provisional Data Science for Social Change Spring 2022 schedule

Breaking Bias in Data and Modeling

About Jed

DATA: There is too much to talk about!

axioms

Data Modeling Process

Data Modeling Process

Data Modeling Process

How can math cause harm?

When we say that we are teachers of "mathematics", which "mathematics" are we talking about?

Wait! Isn't math objective and neutral?

Data modeling applications

Some of the more well known harms

Big Picture

Subtle picture

Many frameworks. Much overlap.

Data Modeling Process

Data Modeling Process

Data Modeling Process

Data Splitting

Data Modeling Process

Data Modeling Process

Data Modeling Process

Data Modeling Process

Data Modeling Process

A framework for critical analysis

Have you ever read a book in a math class?

Data Modeling Process

How high does a bouncy ball bounce?

How high does a bouncy ball bounce?

How high does a bouncy ball bounce?

Bounce Prediction Error

Bounce Prediction Error

Bounce Prediction Error

Bounce Prediction Error

Bounce Prediction Error

Bounce Prediction Error

Bounce Prediction Error

Data Modeling Projects

I have a project for you! But first, in Summary

How can
math
cause harm?

Wait!
Isn't math objective and neutral?

I have a project for you!
But first, in Summary