Data Science and Consequences in Mathematical Modeling

Jedediyah Williams

This page contains resources for a session at the 2021 NCTM Fall Virtual Conference.

slides:

video:

Below is an overview of the material from the session along with additional resources. At the core of what I want to communicate are two ideas:

Math is awesome
Doing harm is bad

These are not original ideas.

A Data Science Process

Overview

The figure below depicts a Data Science Process. This process developed as multiple disciplines converged through modeling with data (Chapman 1999, Han 2012 Figure P.1, Schutt 2013 p. 41, Estrellado 2020 Chp. 3, Nantasenamat 2020, Kolter 2021).

Roughly, this process involves getting data, cleaning it up, inespecting it, modeling it, and sharing the results. This process is not so linear in practice. Rather like the engineering design process it can iteratively converge toward a solution.

This process is a structure for modeling with data and it is a tool that can incorporate concepts of data science into existing courses. Below is a brief description of the five stages stages.

Data — Design and Acquire
The first stage involves getting data. This might require the design of an experiment or survey, or it could involve simply downloading a dataset. In any case, we require a data problem and data with which to solve it.

Preprocess — Wrangle and Prepare
Preprocessing data might be the stage with which students (and teachers) are least familiar, but by many accounts it is where data scientists spend the majority of their time (Kelleher and Tierney 2018, 65-67). Preprocessing involves cleaning up and wrangling data into usable formats. If you have ever combed through a spreadsheet to find cells that interpreted an entry as a date when it wasn't, you have a sense of the preprocessing expereince and how messy it can be.

Explore — Visualize and Inspect
The third stage of Explore almost always involves visualizing data, particularly for the data projects our students work on. In fact, it is not unusual to see this stage called "Visualize". Exploring could also involve analyzing descriptive statistics to familiarize and get a sense of data.

Model — Model and Validate
This is the most important stage in some respects. Chances are good that your curriculum documents emphasize this stage and largely ignore the others! A critical component of modeling which is often left out is validation. Validation involves actually checking to verify your model is working.

Communicate — Communicate and Deploy
The stage of Communicate could also include "Deploy", i.e., incorporating a model into a product or implementing a policy decision. While some students hopefully will deploy products they develop, we emphasize communication as the critical component after developing a mathematical model.

The Data Science Process is useful as a framework for:

Conceptualizing the processes of modeling with data
Scaffolding data projects
Analyzing data applications

The expanded figure below lists some of the components of each stage in the process. This partial list includes steps that students might utilize when working with introductory models.

An expanded view of the Data Science Process. In addition to iterating on these stages, some stages may blend together. For example, exploration can easily blend into the modeling stage.

When first introducing the framework, we might scaffold a data modeling project by providing the data, providing explicit walk-throughs of the middle stages, or offering templates for the last stage. As students familiarize themselves with the process and we remove scaffolding, we are working toward a point when the students can execute an entire data modeling project from a single prompt.

While the structure is useful and captures an abstraction of what modern data science often looks like, it is just one conceptualization. Projects often don't flow in a linear progression from the first stage to the last. For example, after the stage of Explore, we might revisit the stage of Preprocess to do some work we missed. After the stage of Model, we might realize we need to find different features, so we revisit Explore. There can also be a lot of overlap between stages.

Example Projects

The following examples are intended to demonstrate the framing of data projects with the Data Science Process. The complete projects are not necessarily presented, but I try to offer enough information that you could do these projects or facilitate them with students.

Shortcuts to the examples below:

Bouncy Ball - How high does a bouncy ball bounce?
High Tide - When will high tide be on Saturday?
Fruit Classifier - What type of fruit is in this image?

Example: Bouncy Ball

Data

Design
Pose the question to students:
Question
How high does a bouncy ball bounce?

Discuss and translate this into a data question:

Data Question
Given a particular bouncy ball dropped from rest at a height \(h_{i}\) on a consistent surface, what is the predicted peak height \(h_{p}\) of the first bounce?
Aquire
My students use the slo-mo features on their cell phones to record several trials.

Preprocess

Wrangle
The videos tend to be spread across multiple students and sometimes those students are absent the next class! Students parse the video data while attempting some quality control about reading the measurments from the video. Some students develop a procedure where they independently read the measurements from the video and then average their readings.
Prepare
They create tables in a spreadsheet and label the columns.

Explore

Visualize
We create scatter-plots of the bounce-height \(h_{p}\) as a function of drop-height \(h_{i}\). Some students use Google Sheets and some use Desmos.
Inspect
We notice and wonder about the relationship between the variables we are exploring. Students notice that \(h_p\) appears to have a linear relationship to \(h_i\). They wonder if that relationship continues forever as \(h_i\) increases. Some students notice that there are data points which are far away from the others and they wonder if they should revisit their video data. Some students wonder about the bounce-height for a drop-height of zero!

Model

Model
Students model their data with an appropriate choice of function. If they can do a regression, they find the best-fit.

Validate

We use our models to make several predictions of unrecorded drop heights, then validate our model by assessing those predictions against experiment. Our validation tests include interpolated predictions, extrapolated predictions, and extreme extrapolated predictions.

Communicate

Report
Students present their work. This could be a presentation, podcast, or report. An important piece of this project is to reflect on the performance of the model on out-of-domain drop-heights.

Example: Blue Fishing

Data

Design
Pose the question to students:
Question
What time should Zach go casting for blues this Saturday?

Discuss and help them translate this into a data question:

Data Question
Using the tide data from the NOAA Brant Point station NTKM3, when do we predict high tide will be on Saturday?
Aquire
We download a .csv file of tide data from NOAA, or I will pre-download the data over a specific time interval and send it to them.

Preprocess

Wrangle
We import the data into a spreadsheet, remove any corrupted rows (with missing data), and label the columns.
Prepare
We split the data into training and testing sets. For my students, this is the first lab we do where we utilize this type of validation. There are some data-dependent conventions for this but we often use 70% of our data for training our models and 30% for validation of those models.

Explore

Important to note: the testing data is currently hidden at this stage and we are working only with the training data.

Visualize
Inspect
We notice and wonder about the relationship between the variables we are exploring. "It looks like the sine wave things we studied last week!"

Model

Model
Students model their data with an appropriate choice of function. If they can do a regression, they find the best-fit.

Validate
Only after students are satisfied that they have a good model do we move to the step of validation. To validate their model, they test how well it predicts or "fits" to the testing data.

It's a good fit! We can predict with some level of confidence when Zach should go casting for blues on Saturday.

Communicate

Report
Students present their work. This could be a presentation, podcast, or report. We will want to be conscientious of variability and the reliability of our model for predicting future tides. There are also well-known factors that were not present in the training data, e.g., weather. An instructive project would be to collect more NOAA data and test model predictive power over longer time scales.

Example: Fruit Classifier

Fruit Classification Colaboratory Project

Data

Design
Pose the question to students:
Question
What fruit is in the picture?

Discuss and help them translate this into a data question:

Data Question
What are the distinguishing image features that can be used to accurately classify fruit?
Aquire
The images are download-able through the Colab.

Preprocess

Wrangle
We parse the image files into Python lists of Image data.
Prepare
The data is prelabeled for us. We split the data randomly into training and testing sets.

Explore

Visualize
Inspect
We notice and wonder about the relationship between the variables we are exploring. There are several features we could identify. The simplest is color! We could look at the amount of red, green, and blue in our attempt to distinguish the fruit.

Model

Model
Students model their data with an appropriate choice of function. This particular model involves some conditional statements about color values.

Validate

Communicate

Report
Students present their work. This particular project focuses on computational thinking and some technical skills.

Concern

When students are familiar with the Data Science Process, we can leverage its structure to critically analyze data technologies, including those that we are building in class.

When our models are useful, which is to say that they make good predictions, it is important to be able to identify how and why they work as a critical step in anticipating when they will fail. Given the simplicity of most of the models our students will see, this often involves questions of inference and variability. However, students should understand that as their models grow more complex, so do the complications of applying those models, even to the extent that models are not applicable (Birhane 2021).

It can be incredibly difficult to identify harm done by technologies, let alone predict or anticipate the source of it in increasingly complex data systems that are interacting with complex social systems, often with multiple feedback loops. In critically analyzing data science products, context is everything and there is no limit of nuance. When we deploy models or products to make or aid in decisions that affect lives, we have a moral obligation to validate and monitor those products to avoid doing harm.

Harm from technologies may include, but is not limited to, disproportionate harm done to subpopulations. We have seen that historically marginalized populations are often further marginalized by data technologies (Benjamin 2019, Criado-Perez 2020, Eubanks 2019, Wachter-Boettcher 2017). Bias and harm may be incorporated at each stage of the Data Science Process. Assuming we wish to avoid causing harm, we should be conscientious of these issues and pass this understanding on to students who are active consumers and future practitioners of data technologies.

The list below contains some of the considerations we may face at each stage of the data science process.

Data

Problematic Problems
At the very first step of the process, we translate a problem into a data problem. It is entirely possible that the problem is not one that should be addressed. This could be a value judgment or moral dilemma.
Another issue is presuming a metric in the framing of the problem, e.g., we want students to do better on a timed test so we ask how do we get students to solve problems faster?

Harmful Data Collection
This can be passive, e.g., presuming cultural conventions of name formats in a survey. It can also be agressive, manipulating vulnerable populations or violently claiming access.

Biased Data
Perhaps the most notorious concern, biased data, e.g., sampling bias, may lead to poorer outcomes for some. Consider medical studies on primarily men.

Privacy
There is far too much to say about this bullet. Remember that time that a fitness bit company released all that customer data and it revealed the locations of military personnel?
When collecting and working with data, security considerations can be critical.

Consent
Each of us has had data collected about us wihtout our consent. At what point does this cross a line?

Environmental Impact
Data consumes resources. There is A LOT of data.

Preprocess

Labor Exploitation
Outsourcing and underpaying.

Bad Labeling

Having non-experts label, e.g., medical data.
Trauma Experienced by Labelers
Labelers who spend hours each day parsing horrific content.

Explore

Feature Bias
Engineering bias into the model through feature choices.

Bad Data Visualization
Bad data visualizations can mislead. This can be exploited maliciously to influence decisions, or it could inadvertently mislead, possibly amplifying or even creating new biases.

Data Manipulation
Not seeing the pattern you were hoping for, then making it.

Model

What problem are we solving?
Algorithms do what we tell them to do, not what we want them to do (Shane 2019, Chapter 5). Connecting back to the question we set out to answer at the first stage of the Data Science Process: having the best data about the wrong thing means we aren't modeling what we expected or assumed. We may discover such a disconnect at this stage.

Bias in Model Choice
There is bias in the choice of models.

Model-Amplified Bias
While data may biased, so are models. This could be due to simple technical issues like getting "stuck" in a local minima, being anchored by initial model training conditions.

Environmental Impact
Many modern models require enourmous computational resources.

Lack of Robustness
How sensitive is the model output to variability in input? Where are the corner-cases and how does the model behave there?

Peripheral Modeling
Did the model really learn what we think it did?

Communicate

Poor Model Interpretation
While models model what they model, humans (who created the models) interpret them.

Ignoring Variance
What good are predictions when the errors are so large?

Ignoring Conclusions
Even when predictions are good, it is up to people to utilize them or not. Consider the case of resume parser which actually work well, but with results are that are ignored.

Deploying Harmful Products
There are unfortunately many insantces of products being deployed that simply do harm. This can be related to automation bias.

Meta

Feedback Loops
The concept of a feedback loop is used in several fields, e.g., electrical engineering or control systems, and isn't necessarily harmful. Feedback loops lead to harm in data technologies when they amplify bias or exhibit run-away self-reinforcing performance optimization. Consider the example of predictive policing: an algorithm tells police where to look for crime, of course police find crime where they look for it, that policing not only has the appearence of affirming the model prediction but reinforces the prediction so they police there more. Even without considering the bias of the data used in the original prediction, it becomes clear that such a system will perpetuate itself by seeming to justify its use until we step back and realize the effect of the feedback loop. Without a critical analysis of such tools,

Susceptability to Adversarial Attack
Algorithms are brittle.

Lack of Oversight or Auditing
"All models are wrong", so all deployed models need some mechanism for reporting issues.

Consequence

Many people are working on issues of technology and society, and particularly data technologies. Some years ago there was a flood of media attention to the sudden demand for philosophers in tech. There were any number of articles about the modern-day reformulation of the trolley problem in the context of self-driving cars. Engineers creating self-driving cars are in a situation where they need to decide how a car should respond in an accident before that accident happens. Who is at fault when a self-driving car causes harm? This is, unfortunately, not hypothetical.

Here are some interesting articles documenting consequences. I kind of want to start collecting them in an organized way. Maybe let me know if you'd like that (and like to help)?

Machine Bias Risk Assessments in Criminal Sentencing
ProPublica, 2016
How Photos of Your Kids are Powering Surveillance Technology
New York Times, 2019
A.I. Is Learning From Humans. Many Humans
New York Times, 2019
Wrongfully Accused by an Algorithm
New York Times, 2020
Another Arrest, and Jail Time, Due to a Bad Facial Recognition Match
New York Times, 2020
Police are Telling ShotSpotter to Alter Evidence from Gunshot-Detecting AI
Vice, 2021
AI's Islamaphobia Problem
Vox, 2021
Vox, It's Disturbingly Easy to Trick AI into Doing Something Deadly
Vox, 2019
Retail stores are packed with unchecked facial recognition
The Verge, 2021
Eye-catching advances in some AI fields are not real
Science, 2020
Advancing AI in health care: It's all about trust
StaNews, 2019
AI is rushing into patient care - and could raise risks
Scientific American, 2019
Court clears 39 post office operators convicted due to 'corrupt data'
The Guardian, 2021

Conclusions

"Our success, happiness, and wellbeing are never fully of our own making. Others' decisions can profoundly affect the course of our lives."
– Barocas et al 2019, Page 1

When we create models trained on data, we are creating optimization problems to maximize or minimize some set of metrics that we have identified as valuable or have let an algorithm identify as valuable. We are encoding values. The extent to which these algorithms might cause harm can be incredibly difficult to predict or even detect, so it is important that we are able to critically analyze methods and applications so as to minimize harm.

When handing over the tools of mathematics, we are responsible as educators for teaching their responsible use. It is a sin of omission when we fail to acknowledge the consequences of the content we are teaching. If we are teaching mathematics because it is practical then it is because mathematics is applicable in solving problems, and applied mathematics has consequences. From the problems they choose to solve to the solutions they choose to find through the choices embedded in the processes of modeling, practitioners of mathematics are making decisions that affect us all and we are their educators.

Resources

Fast AI's Practical Data Ethics course.

Ethics in Mathematics Readings, curated by Allison N. Miller.

The fantastic videos below are all related to books which I also highly recommend. The videos are a good introduction to each author and their work.

Automating Inequality - Virginia Eubanks

The New Jim Code? - Ruha Benjamin

The collapse of artificial intellgience - Melanie Mitchell

Invisible Women - Caroline Criado Perez

The danger of AI is weirder than you think - Janelle Shane

The future of human-robot interaction - Kate Darling

References

Barocas, Solon, Moritz Hardt, and Arvind Naraynan.  2019.  Fairness and Machine Learning.  https://www.fairmlbook.org 
          
Benjamin, Ruja. 2019. Race after Technology: Abolitionist Tools for the New Jim Code. Medford, MA: Polity.
        
Birhane, Abeba. 2021. “The Impossibility of Automating Ambiguity.” Artificial Life 27, no. 1 (June): 44-61.  10.1162/artl_a_00336.

Brokaw, Galen. 2010. A History of the Khipu. Cambridge, United Kingdom: Cambridge University Press.

Chapman, Pete, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rüdiger Wirth. 1999. “CRISP-DM 1.0: Step-by-step Data Mining Guide.” https://www.the-modeling-agency.com/crisp-dm.pdf.

Criado-Perez, Caroline. 2020. Invisible Women: Exposing Data Bias in a World Designed for Men. London, England: Vintage.

Estrellado, Ryan A., Emily A. Freer, Jesse Mostipak, Joshua M. Rosenberg, and Isabella C. Velázquez. 2020. Data Science in Education Using R. New York, NY: Routledge.

Eubanks, Virginia. 2019. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. New York, NY: Picador.

Friendly, Michael. 2006. “A Brief History of Data Visualization.” In Handbook of Computational Statistics: Data Visualization, edited by C. Chen, W. Hardle, and A. Unwin. Heidelberg: Springer-Verlag.

Fry, Hannah. 2018. Hello World: How to Be Human in the Age of the Machine. London, England: Transworld
Publishers.

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. “Datasheets for Datasets.” (March). https://arxiv.org/abs/1803.09010.

Grus, Joel. 2019. Data Science from Scratch: First Principles with Python. 2nd ed. Sebastopol, CA: O'Reilly Media, Inc.

Han, Jiawei, Kamber Michelin, Jian Pei.  2012.  Data Mining, Concepts and Techniques.  The Morgan Kaufmann Series in Data Management Systems. 

Hill, Kashmir, and Aaron Krolik. 2020. “Your Kids' Photos May be Powering Surveillance.” New York Times:
Artificial Intelligence Edition (New York), 2020, 40-45.

Hooker, Sara. 2021. “Moving beyond “algorithmic bias is a data problem.”” Patterns 2 (4).
https://doi.org/10.1016/j.patter.2021.100241.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical
Learning: with Applications in R. 1st ed. New York, NY: Springer.

Kelleher, John D., and Brendan Tierney. 2018. Data Science. Cambridge, MA: The MIT Press.

Kolter, Zico. 2021. “Introduction.” Practical Data Science. http://www.datasciencecourse.org/notes/intro/.

Mason, Hilary, and Chris Wiggins. 2010. “A Taxonomy of Data Science.” dataists.
http://www.dataists.com/2010/09/a-taxonomy-of-data-science/.

Metz, Cade. 2020. “Machine Learning Takes Many Human Teachers.” New York Times: Artificial Intelligence Edition (New York), 2020, 18-24.

Mitchell, Melanie. 2019. Artificial Intelligence: A Guide for Thinking Humans. New York, NY: Farrar, Straus and Giroux.

Nantasenamat, Chanin. 2020. “The Data Science Process.” Towards Data Science.
https://towardsdatascience.com/the-data-science-process-a19eb7ebc41b.

O'Neil, Cathy. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York, NY: Crown Publishing Group.

Schutt, Rachel, and Cathy O'Neil. 2014. Doing Data Science. Sebastopol, CA: O'Reilly Media, Inc.

Schweinsberg, Martin, and et al. 2021. “Same data, different conclusions: Radical dispersion in empirical
results when independent analysts operationalize and test the same hypothesis.” Organizational Behavior and Human Decision Processes 165:228-249. https://doi.org/10.1016/j.obhdp.2021.02.003.

Seth Jones, Ryan, Zhigang Jia, and Joel Bezaire. 2020. “Giving Birth to Inferential Reasoning.” Mathematics Teacher: Learning and Teaching PK-12 113, no. 4 (April): 287-92.

Shane, Janelle. 2019. You Look Like a Thing and I Love You. New York, NY: Voracious / Little, Brown and Company.

Wachter-Boettcher, Sara. 2017. Technically Wrong: Sexist Apps, Biased Algorithms, and Other Threats of Toxic Tech. New York, NY: W. W. Norton & Company.

Data	Problematic Problems At the very first step of the process, we translate a problem into a data problem. It is entirely possible that the problem is not one that should be addressed. This could be a value judgment or moral dilemma. Another issue is presuming a metric in the framing of the problem, e.g., we want students to do better on a timed test so we ask how do we get students to solve problems faster? Harmful Data Collection This can be passive, e.g., presuming cultural conventions of name formats in a survey. It can also be agressive, manipulating vulnerable populations or violently claiming access. Biased Data Perhaps the most notorious concern, biased data, e.g., sampling bias, may lead to poorer outcomes for some. Consider medical studies on primarily men. Privacy There is far too much to say about this bullet. Remember that time that a fitness bit company released all that customer data and it revealed the locations of military personnel? When collecting and working with data, security considerations can be critical. Consent Each of us has had data collected about us wihtout our consent. At what point does this cross a line? Environmental Impact Data consumes resources. There is A LOT of data.
Preprocess	Labor Exploitation Outsourcing and underpaying. Bad Labeling Having non-experts label, e.g., medical data. Trauma Experienced by Labelers Labelers who spend hours each day parsing horrific content.
Explore	Feature Bias Engineering bias into the model through feature choices. Bad Data Visualization Bad data visualizations can mislead. This can be exploited maliciously to influence decisions, or it could inadvertently mislead, possibly amplifying or even creating new biases. Data Manipulation Not seeing the pattern you were hoping for, then making it.
Model	What problem are we solving? Algorithms do what we tell them to do, not what we want them to do (Shane 2019, Chapter 5). Connecting back to the question we set out to answer at the first stage of the Data Science Process: having the best data about the wrong thing means we aren't modeling what we expected or assumed. We may discover such a disconnect at this stage. Bias in Model Choice There is bias in the choice of models. Model-Amplified Bias While data may biased, so are models. This could be due to simple technical issues like getting "stuck" in a local minima, being anchored by initial model training conditions. Environmental Impact Many modern models require enourmous computational resources. Lack of Robustness How sensitive is the model output to variability in input? Where are the corner-cases and how does the model behave there? Peripheral Modeling Did the model really learn what we think it did?
Communicate	Poor Model Interpretation While models model what they model, humans (who created the models) interpret them. Ignoring Variance What good are predictions when the errors are so large? Ignoring Conclusions Even when predictions are good, it is up to people to utilize them or not. Consider the case of resume parser which actually work well, but with results are that are ignored. Deploying Harmful Products There are unfortunately many insantces of products being deployed that simply do harm. This can be related to automation bias.
Meta	Feedback Loops The concept of a feedback loop is used in several fields, e.g., electrical engineering or control systems, and isn't necessarily harmful. Feedback loops lead to harm in data technologies when they amplify bias or exhibit run-away self-reinforcing performance optimization. Consider the example of predictive policing: an algorithm tells police where to look for crime, of course police find crime where they look for it, that policing not only has the appearence of affirming the model prediction but reinforces the prediction so they police there more. Even without considering the bias of the data used in the original prediction, it becomes clear that such a system will perpetuate itself by seeming to justify its use until we step back and realize the effect of the feedback loop. Without a critical analysis of such tools, Susceptability to Adversarial Attack Algorithms are brittle. Lack of Oversight or Auditing "All models are wrong", so all deployed models need some mechanism for reporting issues.