{"id":567519,"date":"2019-02-13T10:00:01","date_gmt":"2019-02-13T18:00:01","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=567519"},"modified":"2019-02-12T11:47:00","modified_gmt":"2019-02-12T19:47:00","slug":"what-are-the-biases-in-my-data","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/what-are-the-biases-in-my-data\/","title":{"rendered":"What are the biases in my data?"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-567552\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-1024x576.png\" alt=\"\" width=\"1024\" height=\"576\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-343x193.png 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>One challenge with AI algorithmic fairness is that one usually has to know the potential group(s) that an algorithm might discriminate against in the first place. However, in joint work with Maria De-Arteaga, Nathaniel Swinger, Tom Heffernan, and Max Leiserson, we automatically enumerate groups of people that may be discriminated against alongside potential biases. We do this using word embedding, a popular AI tool for processing language. This proves useful for detecting age, gender, religious, and racial biases. The algorithm is designed to capture intersectional biases and to account for the fact that many forms of discrimination\u2014such as racial discrimination\u2014are linked to social constructs that may vary depending on the context, rather than to categories with a fixed definition. In this blog post, we explain the inputs and outputs of our system.<\/p>\n<p><em>WARNING: This blog post includes offensive associations found by our algorithm. We feel that it is important for AI researchers and practitioners to be aware of them before using them in systems.<\/em><\/p>\n<p>AI researchers and practitioners are concerned that their algorithms could perpetuate biases in the data they are trained on, and they don\u2019t want their systems to discriminate against groups of people that are already disadvantaged. One question I often hear people struggling with is, \u201cHow do I know what groups my system might discriminate against in the first place?\u201d Even with a definition of fairness (right now there are <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/fatconference.org\/static\/tutorials\/narayanan-21defs18.pdf\">too many definitions of fairness<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) to a sensitive group, one must first figure out who the groups are. Obvious groups include race, ethnicity, age, and gender, to name a few. It is really hard to define many of these groups, such as ethnicity, and it might vary based on who the system users are. In a recent project with several collaborators ranging from high school and graduate students to faculty members, we attempt to understand this in word embeddings.<\/p>\n<p>Backing up for a moment, as you may know, word embeddings are often trained on billions of words from the Web. They are an extremely popular technology used in processing language that functions for computers like a dictionary does for people. One of the things people find coolest about word embeddings is that they would answer analogy questions like <em>man:woman :: king:<\/em>? with <em>queen<\/em>. One afternoon in early 2016, sitting in our Microsoft Research office in Cambridge, MA, a group of colleagues and I thought of asking another question: <em>man:woman<\/em> :: <em>doctor<\/em>\ud83d\ude15 and, to our horror, the computer responded with <em>nurse<\/em>. We then wrote a simple program to generate as many such <em>man:woman<\/em> analogies as the computer could find. Sure enough, it spit out numerous sexist stereotypes like <em>man:woman<\/em> :: <em>computer programmer:homemaker<\/em>. This is especially concerning in light of the widespread use of this technology in AI applications everywhere. We designed an algorithm that reduced the gender bias, but <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1607.06520\">our NeurIPS paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> also tried to raise awareness in general, because we knew that our even our new embeddings still had plenty of bias beyond gender.<\/p>\n<p>In the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1812.08769.pdf\">new project<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we show how to identify further groups beyond (binary) gender. In particular, we leverage the geometry of the word embeddings to find parallels between clusters of lower-case words and first names. We recover gender groups but also a number of other biases including religious, ethnic, age, and racial groups emerge. Table 1, below, shows 12 groups of names as clustered by the word embedding, along with certain demographic statistics.<\/p>\n<div id=\"attachment_567585\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-567585\" class=\"size-large wp-image-567585\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-1_cluster-of-first-names-1024x335.png\" alt=\"\" width=\"1024\" height=\"335\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-1_cluster-of-first-names-1024x335.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-1_cluster-of-first-names-300x98.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-1_cluster-of-first-names-768x251.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-1_cluster-of-first-names.png 1369w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-567585\" class=\"wp-caption-text\">Table 1. A clustering of first names in the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/code.google.com\/archive\/p\/word2vec\/\">Word2Vec embedding trained on Google News<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> into 12 groups. Also shown are demographic statistics of names computed (after the fact) from US government sources, including gender (percentage female at birth), birth year, and race (percentage Black, Hispanic, Asian Pacific Islander, and White), showing the groups differ strongly. Personally, although it\u2019s not evident in the percentages, I was surprised to see cluster F9 of clearly Israeli\/Jewish names, including several people in my family, simply because there aren\u2019t that many Jewish people out there. Even more concerning are racial groups that are strongly discriminated against in many contexts.<\/p><\/div>\n<p>Similar results were found for <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/fasttext.cc\/\">FastText<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nlp.stanford.edu\/projects\/glove\/\">GloVe<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> embeddings. Next, the algorithmic challenge was in identifying statistically significant associations with these name groups. Unfortunately, the algorithm we developed found numerous offensive associations, as seen in Table 2. More specifically, we first found that all common lower-case words naturally clustered into categories of food-related, family-related, and occupation-related words. We then split each word category into the words that had statistically significant associations with one of the name groups. The results are shown in Table 3. For example, in the food category, I saw the stereotypical Israeli food words <em>kosher<\/em>, <em>hummus<\/em>, and <em>bagel<\/em> associated with the Israeli name cluster.<\/p>\n<div id=\"attachment_567588\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-567588\" class=\"size-large wp-image-567588\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/table-2_most-offensive-to-crowd-sourcing-1024x150.png\" alt=\"\" width=\"1024\" height=\"150\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/table-2_most-offensive-to-crowd-sourcing-1024x150.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/table-2_most-offensive-to-crowd-sourcing-300x44.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/table-2_most-offensive-to-crowd-sourcing-768x113.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/table-2_most-offensive-to-crowd-sourcing.png 1354w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-567588\" class=\"wp-caption-text\">Table 2. The most offensive (as rated by crowdsourcing) associations between groups of names and words that were generated from 3 popular publicly available word embeddings. The associated name groups for w2v can be found in Table 1.<\/p><\/div>\n<p>It was appalling how the algorithm produced offensive biases, one after another, against different groups. On the one hand, this is bad news and should cause concern for those haphazardly using Word Embeddings. On the other hand, the silver lining is that it might mean that computers are easier to probe than humans who may feel unbiased. So, in future projects we can hopefully modify word embeddings to reduce and measure the remaining biases.<\/p>\n<div id=\"attachment_567591\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-567591\" class=\"size-large wp-image-567591\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-3_category-of-words-1024x905.png\" alt=\"\" width=\"1024\" height=\"905\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-3_category-of-words-1024x905.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-3_category-of-words-300x265.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-3_category-of-words-768x679.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/Table-3_category-of-words.png 1339w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-567591\" class=\"wp-caption-text\"><br \/>Table 3. Categories of words (rows\u2014for example, the first row consists of food-related words and the last-row consists of crime-related words) and their associations with the different groups of names from Table 1. Orange cells are biases that were confirmed to be consistent with human stereotypes in a U.S.-based crowdsourcing experiment.<\/p><\/div>\n<p>These biases could cause problems in a number of applications. For instance, word embeddings could naively help one match resumes to jobs (just search \u201cresume matching word2vec\u201d on Google Scholar) by noticing things like intelligently matching a \u201cprogrammer\u201d resume to a job posting for a \u201csoftware developer\u201d. However, they would also \u201chelp\u201d match people to jobs based on their names. And, even more insidiously, we find that even if you remove names from the resumes, the words in the columns in Table 3 are 99% consistent (in some precise sense) without even using the names. For example, the word hostess is closer to <em>volleyball <\/em>than to <em>cornerback<\/em>, while <em>cab driver<\/em> is closer to cornerback than to <em>volleyball<\/em> in the word2vec embedding. This means that people\u2019s job recommendations might exhibit racial and gender discrimination through proxies that they list on their resumes (for example, sports played in high school).<\/p>\n<p>To learn more, you\u2019re invited to read our paper, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1812.08769\">&#8220;What are the biases in my word embedding?&#8221;<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> by Nathaniel Swinger, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam Tauman Kalai, featured in the Proceedings of the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.aies-conference.com\/\">AIES Conference<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, 2019.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One challenge with AI algorithmic fairness is that one usually has to know the potential group(s) that an algorithm might discriminate against in the first place. However, in joint work with Maria De-Arteaga, Nathaniel Swinger, Tom Heffernan, and Max Leiserson, we automatically enumerate groups of people that may be discriminated against alongside potential biases. We [&hellip;]<\/p>\n","protected":false},"author":37074,"featured_media":567552,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Adam Tauman Kalai","user_id":"30834"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194466],"tags":[],"research-area":[13561],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-567519","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-algorithms","msr-research-area-algorithms","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[330695],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788.png\" class=\"img-object-cover\" alt=\"a person sitting at a table using a laptop\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788.png 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-1024x576.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-1066x600.png 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2019\/02\/FAT_Bias-In-Word-EmbeddingSite_02_2019_1400x788-343x193.png 343w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Adam Tauman Kalai","formattedDate":"February 13, 2019","formattedExcerpt":"One challenge with AI algorithmic fairness is that one usually has to know the potential group(s) that an algorithm might discriminate against in the first place. However, in joint work with Maria De-Arteaga, Nathaniel Swinger, Tom Heffernan, and Max Leiserson, we automatically enumerate groups of&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/567519","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=567519"}],"version-history":[{"count":5,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/567519\/revisions"}],"predecessor-version":[{"id":567693,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/567519\/revisions\/567693"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/567552"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=567519"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=567519"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=567519"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=567519"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=567519"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=567519"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=567519"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=567519"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=567519"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=567519"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=567519"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}