{"id":851687,"date":"2022-06-13T04:11:27","date_gmt":"2022-06-13T11:11:27","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-project&#038;p=851687"},"modified":"2022-07-25T00:05:20","modified_gmt":"2022-07-25T07:05:20","slug":"kdd-2022-tutorial-on-large-scale-information-extraction-under-privacy-aware-constraints","status":"publish","type":"msr-project","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/kdd-2022-tutorial-on-large-scale-information-extraction-under-privacy-aware-constraints\/","title":{"rendered":"KDD 2022 Tutorial on Large-Scale Information Extraction under Privacy-Aware Constraints"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background bg-gray-200 has-background- card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 align-self-center\">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 id=\"large-scale-information-extraction-under-privacy-aware-constraints-opens-in-new-tab\"><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/kdd.org\/kdd2022\/lectureTutorial.html\">Large-Scale Information Extraction under Privacy-Aware Constraints<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/h1>\n\n\n\n<p><\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p>In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data from users and their activities. Typically, this data is private and no-body else, except the user, is allowed to look at it. This poses interesting and complex challenges from scalable information extraction point of view: extracting information under privacy aware constraints where there is little data to learn from but need highly accurate models to run on large amount of data across different users. Anonymization of data is typically used to convert private data into pub-licly accessible data. But this may not always be feasible and may require complex differential privacy guarantees in order to be safe from any potential negative consequences. Other techniques involve building models on a small amount of seen (eyes-on) data and a large amount of unseen (eyes-off) data. In this tutorial, we use emails as representative private data to explain the concepts of scalable IE under privacy-aware constraints.<\/p>\n\n\n\n<p>Around 270 billion emails are sent and received per day and more than 60% of them are business to consumer (B2C) emails. At Microsoft, we have developed information extraction systems to extract relevant information from these emails for a large number of scenarios (e.g., flights, hotels, appointments, etc.), for thousands of sender domains (e.g., Amazon, Hilton, British Airways, etc.) and templates (HTML DOM structures)\u2014to power a number of AI applications (e.g., flight reminders, package tracking). As explained above, here are the challenges that we need to overcome to develop information extraction systems for emails:<br><strong>Privacy<\/strong>: For legal and trust reasons, email and its derivatives should be accessible only to the person who it is intended to. Thus, we can\u2019t directly apply the web IE techniques used to extract information from webpages.<br><strong>Efficiency<\/strong>: As we need to process billions of emails every day&#8212;different for different users&#8212;extraction models need to be very efficient.<br><strong>Scalability<\/strong>: There are a large number of variations in the way information is presented in the emails. For example, a flight itinerary is represented in different ways by different providers.<br><strong>Multi-lingual<\/strong>: Users are located across geographies, and hence, the information extraction systems need to work across multiple languages.<\/p>\n\n\n\n<p>To extract information from B2C emails, one needs to classify the emails, cluster them into possible templates, build models to extract information from them, and monitor the models to maintain a high precision and recall. How are the IE techniques for private eyes-off data different compared to that for eyes-on HTML data? How to get labeled data in a privacy preserving manner? What are the different techniques for generating semi-labeled data and learning from them? How to build scalable extraction models across a number of sender domains using different ways to represent the information? How to monitor these models with minimum human intervention? In this tutorial we address all these questions from various research to production perspectives.<\/p>\n\n\n\n\n\n<table>\n\t<tr>\n\t\t<td><figure class=\"alignright is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/10\/RajeevPp2.jpg\" alt=\"Rajeev Gupta\" class=\"wp-image-783187\" width=\"402\" height=\"522\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/10\/RajeevPp2.jpg 402w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/10\/RajeevPp2-231x300.jpg 231w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/10\/RajeevPp2-139x180.jpg 139w\" sizes=\"auto, (max-width: 402px) 100vw, 402px\" \/><\/figure><\/td>\n\t<\/tr>\n\t<tr>\n\t\t\t<td><strong>Rajeev Gupta<\/strong> is a Principal Applied Researcher at Microsoft Search Assistant & Intelligence (MSAI), India. He got his PhD from Indian Institute of Technology (IIT) Mumbai (Bombay) in Computer Science. He has more than 30 publications and 20 patents in the areas of data management, information extraction, and distributed computing in reputed conferences and journals such as TKDE, ICDE, VLDB, WWW, SIGMETRICS, CIKM, KDD, etc. He is currently working in applying AI for information extraction and mining enabling intelligence in Microsoft office for more than four years.<\/td>\n\t<\/tr>\n\t<tr>\n\t\t<td><figure class=\"alignright is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/10\/RanganathSmall.jpg\" alt=\"Ranganath Kondapally\" class=\"wp-image-783187\" width=\"402\" height=\"522\" \/><\/figure><\/td>\n\t\t<\/tr>\n\t<tr>\n\t\t<td><strong>Ranganath Kondapally<\/strong> is a is a Principal Applied Researcher at Microsoft Search Assistant & Intelligence (MSAI), India. He got his PhD in Computer Science from Dartmouth College in the area of computational complexity and streaming algorithms. His areas of interest include information extraction, machine learning algorithms, and complexity theory. He has numerous publications and patents in his name in the areas of information extraction, streaming algorithms, and virtual reality. Currently, he is working on information extraction and inferencing problems on bigdata, powering delightful personal assistant experiences.<\/td>\n\t<\/tr>\n<\/table>\n\n\n\n\n\n<ol class=\"wp-block-list\"><li>Difference between Information Extraction from web pages and that from emails<ol><li>IE from Semi-structured data<\/li><li>IE from Unstructured data<\/li><\/ol><\/li><li>Handling privacy issues \u00a0<ol><li>Anonymization of emails\u00a0while keeping it <em>useful <\/em>for model\u00a0building\u00a0<\/li><li><em>Templatization <\/em>of emails <\/li><\/ol><\/li><li>Semi-supervised techniques to generate labeled-data\u00a0\u00a0<ol><li>Concepts of semi-supervision, using Structural similarity<\/li><li>Active learning <\/li><li>Transfer learning<\/li><li>Knowledge distillation, Teacher-student architecture<\/li><li>Weak labeling, data programming <\/li><\/ol><\/li><li>Text classification across languages with limited data<ol><li>Generating\u00a0labeled\u00a0data for English\u00a0<\/li><li>Using English data to create classifiers for other languages (Spanish, Portuguese, etc.)<\/li><\/ol><\/li><li>Handling\u00a0Scalability\u00a0issues in model building\u00a0\u00a0<ol><li>Adapting web IE techniques by writing wrappers&#8211; their limitations,\u00a0conjunctive\u00a0and disjunctive template (DOM) based models, etc.<\/li><li>Scalability issues due to number of scenarios, sender domains, and templates <\/li><li>Rule induction:\u00a0Programming\u00a0by examples <\/li><li>Machine learning approaches:\u00a0LR,\u00a0CRF,\u00a0LSTM, etc. <\/li><li>Combining rule induction and machine learning to get high precision and recall with high\u00a0coverage\u00a0<ol><li>Ensemble approach: Automated clustering, ML models to\u00a0identify\u00a0individual fields, generating\u00a0<em>xpath<\/em>\u00a0based rules for each cluster.<\/li><li>Iterative approach: Use ML to create models which work well for seen templates and, to some extent, for\u00a0unseen templates; feed the data with probabilistic labels to rule-induction module; semi-supervised approach to remove discrepancies between ML and rule induction outputs; iterating multiple times to improve the performance and coverage.\u00a0<\/li><\/ol><\/li><\/ol><\/li><li>Efficient monitoring to\u00a0maintain\u00a0high precision and\u00a0recall\u00a0<ol><li>Sampling to\u00a0identify\u00a0precision and recall gaps <\/li><li>Anomaly detection algorithms<\/li><\/ol><\/li><\/ol>\n\n\n","protected":false},"excerpt":{"rendered":"<p>In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data from users and their activities. Typically, this data is private and no-body else, except the user, is allowed to look at it. This poses interesting and complex challenges from scalable information extraction [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13555],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-851687","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-search-information-retrieval","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/851687","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":6,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/851687\/revisions"}],"predecessor-version":[{"id":952590,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/851687\/revisions\/952590"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=851687"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=851687"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=851687"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=851687"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=851687"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}