Part1. Introduction to DataCleaner  介绍DataCleaner

  • |--What is data quality(DQ)  数据质量?
  • |--What is data profiling?   数据分析?
  • |--What is datastore?      数据存储?
  •   Composite datastore    综合性数据存储
  • |--What is data monitoring?  数据监控?
  • |--What is master data management(MDM)?  主数据管理?

What is data quality (DQ)?

Data Quality (DQ) is a concept and a business term covering the quality of the data used for a particular purpose. Often times the DQ term is applied to the quality of data used

数据质量即使一种概念又是一种用于说明特定目的包含质量数据的商业术语。很多时间DQ术语被应用到商业决策上,

in business decisions but it may also refer to the quality of data used in research, campaigns, processes and more.

但是也值得是质量数据被应用到研究、质量活动,流程等等。

Working with Data Quality typically varies a lot from project to project, just as the issues in the quality of data vary a lot. Examples of data quality issues include:

处理数据质量通常会随着项目和项目的不同而变化,就像数据质量的问题会有很大的不同。数据质量的问题主要有:

      1. Completeness of data  数据的完整性  
      2. Correctness of data   数据的正确性 
      3. Duplication of data    重复的数据
      4. Uniformedness/standardization of data  数据的标准性   

A less technical definition of high-quality data is, that data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran).

对高质量数据的一个不太技术性的定义是,数据具有高质量,“如果它们适合于其在运营、决策和规划方面的预期用途”(J. M. Juran)。

Data quality analysis (DQA) is the (human) process of examining the quality of data for a particular process or organization. The DQA includes both technical and non-technical

数据质量分析(DQA)是对特定过程或组织的数据质量进行检查的过程。数据质量分析包括的技术元素和非技术元素。

elements. For example, to do a good DQA you will probably need to talk to users, business people, partner organizations and maybe customers.

例如,要做一个好的DQA,您可能需要与用户、业务人员、伙伴组织和可能的客户交谈。

This is needed to asses what the goal of the DQA should be.

这是用来评估DQA目标的必要的。

From a technical viewpoint the main task in a DQA is the data profiling activity, which will help you discover and measure the current state of affairs in the data.

从技术角度来看,DQA中的主要任务是数据分析活动,它将帮助您发现和度量数据中的当前状态。

What is data profiling?

Data profiling is the activity of investigating a datastore to create a 'profile' of it. With a profile of your datastore you will be a lot better equipped to actually use and improve it.

数据分析是对数据存储进行调查以创建它的“概要”的活动。有了您的数据存储的概要,您将会有更好的去实际使用和改进它。

The way you do profiling often depends on whether you already have some ideas about the quality of the data or if you're not experienced with the datastore at hand. Either

您进行分析的方式通常取决于您是否已经对数据的质量有了一些想法,或者您是否对datastore没有经验。

way we recommend an explorative approach, because even though you think there are only a certain amount of issues you need to look for, it is our experience (and reasoning behind a lot of the features of DataCleaner) that it is just as important to check those items in the data that you think are correct!

无论哪种方式,我们都建议采用一种探索性的方法,因为即使您认为您需要查找的问题只有一定数量,但这是我们的经验(并且在数据收集者的许多特性后面进行推理),在您认为正确的数据中检查这些项同样重要!

Typically it's cheap to include a bit more data into your analysis and the results just might surprise you and save you time!

通常,在你的分析中包含更多的数据是没有价值的,结果可能会让你大吃一惊,节省你的时间!

DataCleaner comprises (amongst other aspects) a desktop application for doing data profiling on just about any kind of datastore.

DataCleaner包括(在其他方面)一个桌面应用程序,用于对任何类型的数据存储进行数据分析。

What is a datastore?

A datastore is the place where data is stored. Usually enterprise data lives in relational databases, but there are numerous exceptions to that rule.

数据存储是存储数据的地方。通常企业数据都存在于关系数据库中,但是有许多例外情况。

To comprehend different sources of data, such as databases, spreadsheets, XML files and even standard business applications, we employ the umbrella term datastore .

由不同来源的数据组成,例如数据库、电子表格、XML文件,甚至标准的业务应用程序,我们使用的是术语数据存储。

DataCleaner is capable of retrieving data from a very wide range of datastores. And furthermore, DataCleaner can update the data of most of these datastores as well.

DataCleaner能够从非常广泛的数据存储中检索数据。此外,DataCleaner还可以更新大多数这些数据存储的数据。

A datastore can be created in the UI or via the configuration file . You can create a datastore from any type of source such as: CSV, Excel, Oracle Database, MySQL, etc.

数据存储可以在UI中创建,也可以通过配置文件创建。您可以从任何类型的源(如:CSV、Excel、Oracle数据库、MySQL等)创建数据存储。

点击注册一个新的数据存储

Composite datastore

composite datastore contains multiple datastores . The main advantage of a composite datastore is that it allows you to analyze and process data from multiple sources in the same job.

复合数据存储包含多个数据存储。复合数据存储的主要优势在于,它允许您在同一作业中分析和处理来自多个源的数据。

What is data monitoring?

We've argued that data profiling is ideally an explorative activity. Data monitoring typically isn't! The measurements that you do when profiling often times needs to be continuously checked so that your improvements are enforced through time. This is what data monitoring is typically about.

Data monitoring solutions come in different shapes and sizes. You can set up your own bulk of scheduled jobs that run every night. You can build alerts around it that send you emails if a particular measure goes beyond its allowed thresholds, or in some cases you can attempt ruling out the issue entirely by applying First-Time-Right (FTR) principles that validate data at entry-time. eg. at data registration forms and more.

As of version 3, DataCleaner now also includes a monitoring web application, dubbed "DataCleaner monitor". The monitor is a server application that supports orchestrating and scheduling of jobs, as well as exposing metrics through web services and through interactive timelines and reports. It also supports the configuration and job-building process through wizards and management pages for all the components of the solution. As such, we like to say that the DataCleaner monitor provides a good foundation for the infrastructure needed in a Master Data Management hub.

What is master data management (MDM)?

Master data management (MDM) is a very broad term and is seen materialized in a variety of ways. For the scope of this document it serves more as a context of data quality than an activity that we actually target with DataCleaner per-se.

The overall goals of MDM is to manage the important data of an organization. By "master data" we refer to "a single version of the truth", ie. not the data of a particular system, but for example all the customer data or product data of a company. Usually this data is dispersed over multiple datastores, so an important part of MDM is the process of unifying the data into a single model.

Obviously another of the very important issues to handle in MDM is the quality of data. If you simply gather eg. "all customer data" from all systems in an organization, you will most likely see a lot of data quality issues. There will be a lot of duplicate entries, there will be variances in the way that customer data is filled, there will be different identifiers and even different levels of granularity for defining "what is a customer?". In the context of MDM, DataCleaner can serve as the engine to cleanse, transform and unify data from multiple datastores into the single view of the master data.

DataCleaner第一章的更多相关文章

  1. 《Django By Example》第一章 中文 翻译 (个人学习,渣翻)

    书籍出处:https://www.packtpub.com/web-development/django-example 原作者:Antonio Melé (译者注:本人目前在杭州某家互联网公司工作, ...

  2. MyBatis3.2从入门到精通第一章

    第一章一.引言mybatis是一个持久层框架,是apache下的顶级项目.mybatis托管到goolecode下,再后来托管到github下.(百度百科有解释)二.概述mybatis让程序将主要精力 ...

  3. Nova PhoneGap框架 第一章 前言

    Nova PhoneGap Framework诞生于2012年11月,从第一个版本的发布到现在,这个框架经历了多个项目的考验.一直以来我们也持续更新这个框架,使其不断完善.到现在,这个框架已比较稳定了 ...

  4. 第一章 MYSQL的架构和历史

    在读第一章的过程中,整理出来了一些重要的概念. 锁粒度  表锁(服务器实现,忽略存储引擎). 行锁(存储引擎实现,服务器没有实现). 事务的ACID概念 原子性(要么全部成功,要么全部回滚). 一致性 ...

  5. 第一章 Java多线程技能

    1.初步了解"进程"."线程"."多线程" 说到多线程,大多都会联系到"进程"和"线程".那么这两者 ...

  6. 【读书笔记】《编程珠玑》第一章之位向量&位图

    此书的叙述模式是借由一个具体问题来引出的一系列算法,数据结构等等方面的技巧性策略.共分三篇,基础,性能,应用.每篇涵盖数章,章内案例都非常切实棘手,解说也生动有趣. 自个呢也是头一次接触编程技巧类的书 ...

  7. 《JavaScript高级程序设计(第3版)》阅读总结记录第一章之JavaScript简介

    前言: 为什么会想到把<JavaScript 高级程序设计(第 3 版)>总结记录呢,之前写过一篇博客,研究的轮播效果,后来又去看了<JavaScript 高级程序设计(第3版)&g ...

  8. 《Entity Framework 6 Recipes》翻译系列 (1) -----第一章 开始使用实体框架之历史和框架简述

    微软的Entity Framework 受到越来越多人的关注和使用,Entity Framework7.0版本也即将发行.虽然已经开源,可遗憾的是,国内没有关于它的书籍,更不用说好书了,可能是因为EF ...

  9. 《Entity Framework 6 Recipes》翻译系列(2) -----第一章 开始使用实体框架之使用介绍

    Visual Studio 我们在Windows平台上开发应用程序使用的工具主要是Visual Studio.这个集成开发环境已经演化了很多年,从一个简单的C++编辑器和编译器到一个高度集成.支持软件 ...

随机推荐

  1. 将centos_yum源更换为阿里云(官方文档)

    http://mirrors.aliyun.com/help/centos?spm=5176.bbsr150321.0.0.d6ykiD 1.备份 mv /etc/yum.repos.d/CentOS ...

  2. TCP/IP协议全解析 三次握手与四次挥手[转]

    所谓三次握手(Three-Way Handshake)即建立TCP连接,就是指建立一个TCP连接时,需要客户端和服务端总共发送3个包以确认连接的建立.所谓四次挥手(Four-Way Wavehand) ...

  3. Android webView包装WebAPP

    前言 Android webView 兼容体验真的差到了极点!! 前一阵子,老板要讲 WebAPP 放到 Android 和 iOS 里面,而我因为以前做过安卓,所以这方面就由我来打包, 原理是很简单 ...

  4. 过16道练习学习Linq和Lambda(转)

    1. 查询Student表中的所有记录的Sname.Ssex和Class列. select sname,ssex,class from studentLinq: from s in Students ...

  5. Asp.net mvc 中View 的呈现(二)

    [toc] 上一节介绍了 Asp.net mvc 中除 ViewResult 外的所有的 ActionResult,这一节介绍 ViewResult. ViewResultBase ViewResul ...

  6. 【原创】Hibernate通过实体类自动建表时type=MyISAM的问题

    ι 版权声明:本文为博主原创文章,未经博主允许不得转载. 当使用的mysql数据库为5.5版本时,方言需要设置为 <property name="hibernate.dialect&q ...

  7. 使用Python读写csv文件的三种方法

    Python读写csv文件 觉得有用的话,欢迎一起讨论相互学习~Follow Me 前言 逗号分隔值(Comma-Separated Values,CSV,有时也称为字符分隔值,因为分隔字符也可以不是 ...

  8. IDEA Tomcat:Failed to initialize end point associated with ProtocolHandler

    发现Tomcat的日志中出现这样的错误,一般都是端口被占用了.在任务管理器中检查是否有其他的应用在使用该端口 Failed to initialize end point associated wit ...

  9. XBIM 基于 WexBIM 文件在 WebGL 浏览和加载

    目录 xBIM 应用与学习 (一) xBIM 应用与学习 (二) xBIM 基本的模型操作 xBIM 日志操作 XBIM 3D 墙壁案例 xBIM 格式之间转换 xBIM 使用Linq 来优化查询 x ...

  10. 洛谷 P3711 仓鼠的数学题 [伯努利数 fft]

    P3711 仓鼠的数学题 题意: \[ S_m(x) = \sum_{k=0}^x k^m, 0^0=1\quad 求 \sum_{m=0}^n S_m(x)a_m \] 的答案多项式\(\sum_{ ...