{"id":294077,"date":"2016-09-19T00:38:57","date_gmt":"2016-09-19T07:38:57","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-project&#038;p=294077"},"modified":"2025-03-04T10:47:26","modified_gmt":"2025-03-04T18:47:26","slug":"enterprise-dictionary","status":"publish","type":"msr-project","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/enterprise-dictionary\/","title":{"rendered":"Enterprise Dictionary"},"content":{"rendered":"<h2><a name=\"_Toc397646335\"><\/a>1.\u00a0\u00a0\u00a0 Project Introduction<\/h2>\n<p>&#8220;Everyday we are faced with a sea of acronyms, ever changing group structures, and fast-tracked projects.&#8221;<\/p>\n<p>Currently, collation and curation of corporate knowledge is a painstaking manual process. We seek to move these activities into the background so that the relationships between different people, project updates, and emerging milestones can be surfaced in an ambient light-weight way.<\/p>\n<p>This is our project: <strong>Enterprise Dictionary<\/strong>. It is a research project which aims to learn enterprise entities and their properties automatically from enterprise documents. It can gracefully collect information about workplace collaborations, project ebb-and-flows so that institutional knowledge can be easily gathered and shared.<\/p>\n<p>The ultimate goal of our project is to build an <strong>Enterprise Satori.<\/strong> So that we can enable others to build knowledge-empowered applications in Enterprise.<\/p>\n<p>&nbsp;<\/p>\n<h2><a name=\"_Toc397646338\"><\/a>2.\u00a0\u00a0\u00a0 Framework<\/h2>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-294113 size-full\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2016\/09\/ED_Framework.png\" alt=\"Enterprise Dictionary Framework\" width=\"969\" height=\"781\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2016\/09\/ED_Framework.png 969w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2016\/09\/ED_Framework-300x242.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2016\/09\/ED_Framework-768x619.png 768w\" sizes=\"auto, (max-width: 969px) 100vw, 969px\" \/><\/p>\n<p>Our framework includes the <strong>offline<\/strong> part and the <strong>online<\/strong> part.<\/p>\n<h2><a name=\"_Toc397646339\"><\/a>3.\u00a0\u00a0\u00a0 Offline Part<\/h2>\n<h3><a name=\"_Toc397646340\"><\/a>1)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Data Source<\/h3>\n<p>Currently, we can handle different data sources, such as PDF, WORD, PPT, ONENOTE, MEETING, EMAIL, SharePoint\/Web Page, etc.<\/p>\n<h3><a name=\"_Toc397646341\"><\/a>2)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Data Parser & Cleaner<\/h3>\n<p>We parse all data sources to two formats: <strong>XML<\/strong> and <strong>TXT<\/strong>.<\/p>\n<p><strong>XML<\/strong> is semi-structured. It can be used for semi-structured knowledge learning. E.g. Using the Kable\u2019s technique to extract people-project relations.<\/p>\n<p><strong>TXT<\/strong> is unstructured. It can be used for acronym\/expansion mining, definition mining, project-concept (type) mining, etc.<\/p>\n<h3><a name=\"_Toc397646342\"><\/a>3)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 People-Project Mining<\/h3>\n<p>We apply Kable (semi-structure mining) technology to extract facts <people, role, project> from the documents.<\/p>\n<p>Actually, the real technology is more complex than table extraction. We need to identify the &#8220;structure&#8221; first, then extract the &#8220;data&#8221;, followed by classify the content to it&#8217;s &#8220;function\/property&#8221;, and then, build the association. This technology is successfully applied in Bing DU to automatic extract knowledge for Satori.<\/p>\n<h3><a name=\"_Toc397646343\"><\/a>4)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Definition Mining<\/h3>\n<p>We start from Hearst patterns, isA patterns, and some other patterns, e.g. &#8220;is a&#8221;, &#8220;is one of&#8221;, &#8220;means&#8221;, &#8220;stands for&#8221;, \u00a0etc. to get the candidate definition sentences for each topic. We manually label some ground-truth for the definition, and use Rank\u00a0SVM to train a linear classifier to classify whether it is definition or not. The features includes: NLP features, pattern features, data source, author, acronym, structure, statistical, symbol, conceptualization, embedding, keywords, etc.<\/p>\n<p><strong><em>Our discriminative features<\/em><\/strong>:<\/p>\n<ul>\n<li>Word Embedding (Implicit embedding)<\/li>\n<li><a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/conceptualization\/\">Conceptualization <\/a>(Explicit embedding)<\/li>\n<\/ul>\n<h3><a name=\"_Toc397646344\"><\/a>5)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Acronym\/Expansion Mining<\/h3>\n<p>Some patterns & rules are introduced to mine the acronym. For example, a full name followed by a bracket containing word with capital characters, e.g. Affinity Intent Map (AIM). Other patterns includes: a.k.a., f.k.a., etc.<\/p>\n<p>After we get a set of acronyms, we make an assumption that the same acronym appeared in the same document should share the same semantic meaning. So, we link all acronym with the same expression in current document to the same extension.<\/p>\n<h3><a name=\"_Toc397646345\"><\/a>6)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Embedding Training (Ongoing)<\/h3>\n<p>We want to learn an implicit embedding for each term (word, or phrase) in Enterprise. So that we can measure the similarity between two projects.<\/p>\n<p>We plan to leverage the AIM model for enterprise embedding training.<\/p>\n<h3><a name=\"_Toc397646346\"><\/a>7)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Project-Concept Mining (Ongoing)<\/h3>\n<p>We want to learn the concept (type) for each entity in Enterprise. E.g. we want to learn AIM is a <em>Deep Learning Project<\/em>, <em>Word Embedding Project<\/em>, etc.<\/p>\n<p>So that we can learn an explicit embedding for each entity, and can support explicit topic reasoning for enterprise documents.<\/p>\n<h2><a name=\"_Toc397646347\"><\/a>4.\u00a0\u00a0\u00a0 Online Part<\/h2>\n<h3><a name=\"_Toc397646348\"><\/a>1)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Query Parser<\/h3>\n<p>Since we try to support question-like queries, we need a parser to understand users\u2019 queries.<\/p>\n<p>First, we extract question patterns from emails in &#8220;program manager information&#8221; discussion group, which is a place to let Microsoft employee ask questions to PM.<\/p>\n<p>As soon as get the question pattern, we also get a set of &#8220;pseudo labels&#8221; as well as sentences for training. We apply our parallelized sequential RNN (which is similar to semi-CRF, we are working with Mei-Yuh team to apply this for Cortana) to parse\/understand the query. And then generate related SQL to get the results.<\/p>\n<p>Most of the queries are 1-hop, but, for &#8220;steven yao&#8217;s PM&#8221;, it is a 2-hop query. We decomposed it into &#8220;steven yao&#8217;s projects X&#8221; and &#8220;project X&#8217;s PM&#8221; to get the results.<\/p>\n<p>The output of this parser includes:<\/p>\n<ul>\n<li>Query Term<\/li>\n<li>Type: including \u201c<strong>(What)<\/strong> is\u201d, \u201c<strong>(Who)<\/strong> works on what\u201d, and \u201cWho works on <strong>(what)<\/strong>\u201d, where <em>(what)<\/em> and <em>(who)<\/em> are the answers that users look for.<\/li>\n<\/ul>\n<h3><a name=\"_Toc397646349\"><\/a>2)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Disambiguation Page<\/h3>\n<p>If the user\u2019s query is ambiguous, e.g. \u201cAIM\u201d, then we will return a page which contain different sense of this term as follow:<\/p>\n<p>Then the user can pick up the sense of the term he looks for.<\/p>\n<p>If the user search a term like \u201cWho works on QAS\u201d, we also try to distinguish the sense:<\/p>\n<p>If user search a person\u2019s name, e.g. \u201cYe-yi Wang\u201d, we will try to return his projects directly:<\/p>\n<h3><a name=\"_Toc397646350\"><\/a>3)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Detailed Page<\/h3>\n<p>When the user pick up a sense of the term, we will return a detailed page, which includes three parts: definition, examples, and related people.<\/p>\n<h3><\/h3>\n<h3><strong>Group<\/strong><\/h3>\n<p><img decoding=\"async\" src=\"https:\/\/concept.research.microsoft.com\/Images\/email.png\" width=\"15\" \/>\u00a0\u00a0<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/group\/data-mining-enterprise-intelligence\/\">Data Mining and Enterprise Intelligence Group<\/a>, MSRA<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>1.\u00a0\u00a0\u00a0 Project Introduction &#8220;Everyday we are faced with a sea of acronyms, ever changing group structures, and fast-tracked projects.&#8221; Currently, collation and curation of corporate knowledge is a painstaking manual process. We seek to move these activities into the background so that the relationships between different people, project updates, and emerging milestones can be surfaced [&hellip;]<\/p>\n","protected":false},"featured_media":294875,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13545,13555],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-294077","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-research-area-search-information-retrieval","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2014-04-14","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Lei Ji","user_id":32636,"people_section":"Related people","alias":"leiji"}],"msr_research_lab":[199560],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/294077","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":2,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/294077\/revisions"}],"predecessor-version":[{"id":1133544,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/294077\/revisions\/1133544"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/294875"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=294077"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=294077"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=294077"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=294077"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=294077"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}