Part1. Introduction to DataCleaner  介绍DataCleaner

  • |--What is data quality(DQ)  数据质量?
  • |--What is data profiling?   数据分析?
  • |--What is datastore?      数据存储?
  •   Composite datastore    综合性数据存储
  • |--What is data monitoring?  数据监控?
  • |--What is master data management(MDM)?  主数据管理?

What is data quality (DQ)?

Data Quality (DQ) is a concept and a business term covering the quality of the data used for a particular purpose. Often times the DQ term is applied to the quality of data used

数据质量即使一种概念又是一种用于说明特定目的包含质量数据的商业术语。很多时间DQ术语被应用到商业决策上,

in business decisions but it may also refer to the quality of data used in research, campaigns, processes and more.

但是也值得是质量数据被应用到研究、质量活动,流程等等。

Working with Data Quality typically varies a lot from project to project, just as the issues in the quality of data vary a lot. Examples of data quality issues include:

处理数据质量通常会随着项目和项目的不同而变化,就像数据质量的问题会有很大的不同。数据质量的问题主要有:

      1. Completeness of data  数据的完整性  
      2. Correctness of data   数据的正确性 
      3. Duplication of data    重复的数据
      4. Uniformedness/standardization of data  数据的标准性   

A less technical definition of high-quality data is, that data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran).

对高质量数据的一个不太技术性的定义是,数据具有高质量,“如果它们适合于其在运营、决策和规划方面的预期用途”(J. M. Juran)。

Data quality analysis (DQA) is the (human) process of examining the quality of data for a particular process or organization. The DQA includes both technical and non-technical

数据质量分析(DQA)是对特定过程或组织的数据质量进行检查的过程。数据质量分析包括的技术元素和非技术元素。

elements. For example, to do a good DQA you will probably need to talk to users, business people, partner organizations and maybe customers.

例如,要做一个好的DQA,您可能需要与用户、业务人员、伙伴组织和可能的客户交谈。

This is needed to asses what the goal of the DQA should be.

这是用来评估DQA目标的必要的。

From a technical viewpoint the main task in a DQA is the data profiling activity, which will help you discover and measure the current state of affairs in the data.

从技术角度来看,DQA中的主要任务是数据分析活动,它将帮助您发现和度量数据中的当前状态。

What is data profiling?

Data profiling is the activity of investigating a datastore to create a 'profile' of it. With a profile of your datastore you will be a lot better equipped to actually use and improve it.

数据分析是对数据存储进行调查以创建它的“概要”的活动。有了您的数据存储的概要,您将会有更好的去实际使用和改进它。

The way you do profiling often depends on whether you already have some ideas about the quality of the data or if you're not experienced with the datastore at hand. Either

您进行分析的方式通常取决于您是否已经对数据的质量有了一些想法,或者您是否对datastore没有经验。

way we recommend an explorative approach, because even though you think there are only a certain amount of issues you need to look for, it is our experience (and reasoning behind a lot of the features of DataCleaner) that it is just as important to check those items in the data that you think are correct!

无论哪种方式,我们都建议采用一种探索性的方法,因为即使您认为您需要查找的问题只有一定数量,但这是我们的经验(并且在数据收集者的许多特性后面进行推理),在您认为正确的数据中检查这些项同样重要!

Typically it's cheap to include a bit more data into your analysis and the results just might surprise you and save you time!

通常,在你的分析中包含更多的数据是没有价值的,结果可能会让你大吃一惊,节省你的时间!

DataCleaner comprises (amongst other aspects) a desktop application for doing data profiling on just about any kind of datastore.

DataCleaner包括(在其他方面)一个桌面应用程序,用于对任何类型的数据存储进行数据分析。

What is a datastore?

A datastore is the place where data is stored. Usually enterprise data lives in relational databases, but there are numerous exceptions to that rule.

数据存储是存储数据的地方。通常企业数据都存在于关系数据库中,但是有许多例外情况。

To comprehend different sources of data, such as databases, spreadsheets, XML files and even standard business applications, we employ the umbrella term datastore .

由不同来源的数据组成,例如数据库、电子表格、XML文件,甚至标准的业务应用程序,我们使用的是术语数据存储。

DataCleaner is capable of retrieving data from a very wide range of datastores. And furthermore, DataCleaner can update the data of most of these datastores as well.

DataCleaner能够从非常广泛的数据存储中检索数据。此外,DataCleaner还可以更新大多数这些数据存储的数据。

A datastore can be created in the UI or via the configuration file . You can create a datastore from any type of source such as: CSV, Excel, Oracle Database, MySQL, etc.

数据存储可以在UI中创建,也可以通过配置文件创建。您可以从任何类型的源(如:CSV、Excel、Oracle数据库、MySQL等)创建数据存储。

点击注册一个新的数据存储

Composite datastore

composite datastore contains multiple datastores . The main advantage of a composite datastore is that it allows you to analyze and process data from multiple sources in the same job.

复合数据存储包含多个数据存储。复合数据存储的主要优势在于,它允许您在同一作业中分析和处理来自多个源的数据。

What is data monitoring?

We've argued that data profiling is ideally an explorative activity. Data monitoring typically isn't! The measurements that you do when profiling often times needs to be continuously checked so that your improvements are enforced through time. This is what data monitoring is typically about.

Data monitoring solutions come in different shapes and sizes. You can set up your own bulk of scheduled jobs that run every night. You can build alerts around it that send you emails if a particular measure goes beyond its allowed thresholds, or in some cases you can attempt ruling out the issue entirely by applying First-Time-Right (FTR) principles that validate data at entry-time. eg. at data registration forms and more.

As of version 3, DataCleaner now also includes a monitoring web application, dubbed "DataCleaner monitor". The monitor is a server application that supports orchestrating and scheduling of jobs, as well as exposing metrics through web services and through interactive timelines and reports. It also supports the configuration and job-building process through wizards and management pages for all the components of the solution. As such, we like to say that the DataCleaner monitor provides a good foundation for the infrastructure needed in a Master Data Management hub.

What is master data management (MDM)?

Master data management (MDM) is a very broad term and is seen materialized in a variety of ways. For the scope of this document it serves more as a context of data quality than an activity that we actually target with DataCleaner per-se.

The overall goals of MDM is to manage the important data of an organization. By "master data" we refer to "a single version of the truth", ie. not the data of a particular system, but for example all the customer data or product data of a company. Usually this data is dispersed over multiple datastores, so an important part of MDM is the process of unifying the data into a single model.

Obviously another of the very important issues to handle in MDM is the quality of data. If you simply gather eg. "all customer data" from all systems in an organization, you will most likely see a lot of data quality issues. There will be a lot of duplicate entries, there will be variances in the way that customer data is filled, there will be different identifiers and even different levels of granularity for defining "what is a customer?". In the context of MDM, DataCleaner can serve as the engine to cleanse, transform and unify data from multiple datastores into the single view of the master data.

DataCleaner第一章的更多相关文章

  1. 《Django By Example》第一章 中文 翻译 (个人学习,渣翻)

    书籍出处:https://www.packtpub.com/web-development/django-example 原作者:Antonio Melé (译者注:本人目前在杭州某家互联网公司工作, ...

  2. MyBatis3.2从入门到精通第一章

    第一章一.引言mybatis是一个持久层框架,是apache下的顶级项目.mybatis托管到goolecode下,再后来托管到github下.(百度百科有解释)二.概述mybatis让程序将主要精力 ...

  3. Nova PhoneGap框架 第一章 前言

    Nova PhoneGap Framework诞生于2012年11月,从第一个版本的发布到现在,这个框架经历了多个项目的考验.一直以来我们也持续更新这个框架,使其不断完善.到现在,这个框架已比较稳定了 ...

  4. 第一章 MYSQL的架构和历史

    在读第一章的过程中,整理出来了一些重要的概念. 锁粒度  表锁(服务器实现,忽略存储引擎). 行锁(存储引擎实现,服务器没有实现). 事务的ACID概念 原子性(要么全部成功,要么全部回滚). 一致性 ...

  5. 第一章 Java多线程技能

    1.初步了解"进程"."线程"."多线程" 说到多线程,大多都会联系到"进程"和"线程".那么这两者 ...

  6. 【读书笔记】《编程珠玑》第一章之位向量&位图

    此书的叙述模式是借由一个具体问题来引出的一系列算法,数据结构等等方面的技巧性策略.共分三篇,基础,性能,应用.每篇涵盖数章,章内案例都非常切实棘手,解说也生动有趣. 自个呢也是头一次接触编程技巧类的书 ...

  7. 《JavaScript高级程序设计(第3版)》阅读总结记录第一章之JavaScript简介

    前言: 为什么会想到把<JavaScript 高级程序设计(第 3 版)>总结记录呢,之前写过一篇博客,研究的轮播效果,后来又去看了<JavaScript 高级程序设计(第3版)&g ...

  8. 《Entity Framework 6 Recipes》翻译系列 (1) -----第一章 开始使用实体框架之历史和框架简述

    微软的Entity Framework 受到越来越多人的关注和使用,Entity Framework7.0版本也即将发行.虽然已经开源,可遗憾的是,国内没有关于它的书籍,更不用说好书了,可能是因为EF ...

  9. 《Entity Framework 6 Recipes》翻译系列(2) -----第一章 开始使用实体框架之使用介绍

    Visual Studio 我们在Windows平台上开发应用程序使用的工具主要是Visual Studio.这个集成开发环境已经演化了很多年,从一个简单的C++编辑器和编译器到一个高度集成.支持软件 ...

随机推荐

  1. DRBD的主备安装配置

    drbd软件包链接:https://pan.baidu.com/s/1eUcXVyU 密码:00ul 1.使用的资源:1.1 系统centos6.9 mini1.2 两台节点主机node1.node2 ...

  2. Electron 打包Mac安装包代码签名问题解决方案Could not get code signature for running application

    最近一直在做electron应用的打包,集成mac版本的自动更新时出现了问题. Error: Could not get code signature for running application ...

  3. 网页转图片--- html2canvas截图

    最近有个做在线名片(可保存图片至本地)的任务,特意研究了一下图片生成,也踩了几个坑.特此总结一下,顺便分享一下demo: 链接:https://pan.baidu.com/s/1o98UBJO 密码: ...

  4. spring boot + vue + element-ui全栈开发入门——windows开发环境

     一.node.js开发环境 windows系统,去网站https://nodejs.org/en/download/,下载对应的安装程序,并安装Windows Installer (.msi) 接下 ...

  5. java对象表示方式--XStream

    对象表示有各种各样的方式,序列化只是其中的一种而已.表示一个对象的目的无非就是为了对象<---->IO之间相互认识,至于怎么认识,那就有很多选择了.除了之前讲过的序列化,还可以选择将数据J ...

  6. 代理(Proxy)模式

    代理模式的类图如下所示: 客户端想调用的是RealSubject,由于某种考虑或原因,只能直接访问到ProxySubject,再由ProxySubject去调用RealSubject,这就完成了一次代 ...

  7. 洛谷 [P3384] 树链剖分 模版

    支持各种数据结构上树,注意取膜. #include <iostream> #include <cstring> #include <algorithm> #incl ...

  8. NOIP 2017 Day 0. 游记

    刚从曲师大试机回来... 不巧,我抽到了和去年一样的考场,还是那么难用的XP,还是那么难用的键盘. 似乎在考场上有一股奇怪的力量,我本来在自己电脑上打板子打的没那么快,但是试机的那段时间..说出来你们 ...

  9. BZOJ 3456: 城市规划 [多项式求逆元 DP]

    题意: 求出n个点的简单(无重边无自环)无向连通图数目.方案数mod 1004535809(479 * 2 ^ 21 + 1)即可. n<=130000 DP求方案 g(n) n个点所有图的方案 ...

  10. 利用Effmpeg 提取视频中的音频(mp3)

    在B站看到一个up发的病名为爱的钢琴曲,感觉很好听,然后当然是要加入歌单啊.然而不知道怎么转换成mp3,找来找去找到了EFFmpeg 这篇只是达到了我简单的需求,以后可能会有EFFmpeg更详细的使用 ...