Questions:
1. Relational Database
2. Metadata
3. Inner Join and Outer Join
4. Database normalization
5. Normal forms
6. Denormalization
7. Data Warehouse
8. ETL
9. Star Schema and Snowflake Schema
10. Fact table and Dimension table
Reference Answers:
1. Relational Database
A relational database is a database that conforms to the relational model, and refers to a database's data and schema (the database's structure of how those data are arranged). The term "relational database" is sometimes informally used to refer to a relational database management system, which is the software that is used to create and use a relational database.
关系数据库,是建立在关系模型基础上的数据库 , 借助集合代数等数学概念和方法来处理数据库中的数据。现实世界中的各种实体以及实体之间的各种联系均用关系模型来表示。关系模型是由埃德加 · 科德于 1970 年首先提出的,並配合 “ 科德十二定律 ” 。现如今虽然对此模型有一些批评意见,但它还是数据存储的传统标准。标准数据查询语言 SQL 就是一种基于关系数据库的语言 , 这种语言执行对关系数据库中数据的检索和操作。 关系模型由关系数据结构、关系操作集合、关系完整性约束三部分组成。
2. Metadata
Metadata (meta data, or sometimes metainformation) is "data about data", of any sort in any media. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema.
元数据, 即描述数据的数据( data about data ) , 主要描述数据的属性信息,用来支持之时存储位置、历史资料、资源寻找、文件记录等功能。元数据是一种电子式目录,为了达到编制目录的目的,必须描述并收藏数据的内容或特色,进而达成协助数据检索的目的。
3. Inner Join and Outer Join
An inner join requires each record in the two joined tables to have a matching record. An inner join essentially combines the records from two tables (A and B) based on a given join-predicate. The result of the join can be defined as the outcome of first taking the Cartesian product (or cross-join) of all records in the tables (combining every record in table A with every record in table B) - then return all records which satisfy the join predicate. Actual SQL implementations will normally use other approaches where possible, since computing the Cartesian product is not very efficient. This type of join occurs most commonly in applications, and represents the default join-type.
An outer join does not require each record in the two joined tables to have a matching record. The joined table retains each record—even if no other matching record exists. Outer joins subdivide further into left outer joins, right outer joins, and full outer joins, depending on which table(s) one retains the rows from (left, right, or both).
内连接查询操作列出与连接条件匹配的数据行,它使用比较运算符比较被连接列的列值。内连接分三种:
1) 等值连接:在连接条件中使用等于号 (=) 运算符比较被连接列的列值,其查询结果中列出被连接表中的所有列,包括其中的重复列。
2) 不等连接: 在连接条件使用除等于运算符以外的其它比较运算符比较被连接的列的列值。这些运算符包括 > 、 >= 、 <= 、 < 、 !> 、 !< 和 <> 。
3) 自然连接:在连接条件中使用等于 (=) 运算符比较被连接列的列值,但它使用选择列表指出查询结果集合中所包括的列,并删除连接表中的重复列。
内连接时,返回查询结果集合中的仅是符合查询条件 ( WHERE 搜索条件或 HAVING 条件 ) 和连接条件的行。而采用外连接时,它返回到查询结果集合中的不仅包含符合连接条件的行,而且还包括左表 ( 左外连接时 ) 、右表 ( 右外连接时 ) 或两个边接表 ( 全外连接 ) 中的所有数据行。
4. Database normalization
Database normalization, sometimes referred to as canonical synthesis, is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies. For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent when the data within the table is updated, leading to a loss of data integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its structure reflects the basic assumptions for when multiple instances of the same information should be represented by a single instance only.
Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated transactions (e.g. an automated teller machine), while less normalized tables tend to be used in database applications that need to map complex relationships between data entities and data attributes (e.g. a reporting application, or a full-text search application).
Database theory describes a table's degree of normalization in terms of normal forms of successively higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second normal form (2NF) as well; but the reverse is not necessarily the case.
Although the normal forms are often defined informally in terms of the characteristics of tables, rigorous definitions of the normal forms are concerned with the characteristics of mathematical constructs known as relations. Whenever information is represented relationally, it is meaningful to consider the extent to which the representation is normalized.
数据库标准化,是数据库设计中应用到的一系列原理和技术,以减少数据库中数据冗余,增进数据的一致性。
5. Normal forms
The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is to inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF.
The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of its tables are in normal form n.
Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these higher normal forms.
Edgar F. Codd originally defined the first three normal forms (1NF, 2NF, and 3NF). These normal forms have been summarized as requiring that all non-key attributes be dependent on "the key, the whole key and nothing but the key". The fourth and fifth normal forms (4NF and 5NF) deal specifically with the representation of many-to-many and one-to-many relationships among attributes. Sixth normal form (6NF) incorporates considerations relevant to temporal databases.
范式即构造数据库必须遵循一定的规则。在关系数据库中,这种规则就是范式。范式是符合某一种级别的关系模式的集合。关系数据库中的关系必须满足一定的要求,即满足不同的范式。目前关系数据库有六种范式:第一范式( 1NF )、第二范式( 2NF )、第三范式( 3NF )、第四范式( 4NF )、第五范式( 5NF )和第六范式( 6NF )。满足最低要求的范式是第一范式( 1NF )。在第一范式的基础上进一步满足更多要求的称为第二范式( 2NF ),其余范式以次类推。一般说来,数据库只需满足第三范式( 3NF )就行了。
1 第一范式( 1NF )
在任何一个关系数据库中,第一范式( 1NF )是对关系模式的基本要求,不满足第一范式( 1NF )的数据库就不是关系数据库。
所谓第一范式( 1NF )是指数据库表的每一列都是不可分割的基本数据项,同一列中不能有多个值,即实体中的某个属性不能有多个值或者不能有重复的属性。如果出现重复的属性,就可能需要定义一个新的实体,新的实体由重复的属性构成,新实体与原实体之间为一对多关系。在第一范式( 1NF )中表的每一行只包含一个实例的信息。
2 第二范式( 2NF )
第二范式( 2NF )是在第一范式( 1NF )的基础上建立起来的,即满足第二范式( 2NF )必须先满足第一范式( 1NF )。第二范式( 2NF )要求数据库表中的每个实例或行必须可以被唯一地区分。
第二范式( 2NF )要求实体的属性完全依赖于主关键字。所谓完全依赖是指不能存在仅依赖主关键字一部分的属性,如果存在,那么这个属性和主关键字的这一部分应该分离出来形成一个新的实体,新实体与原实体之间是一对多的关系。为实现区分通常需要为表加上一个列,以存储各个实例的惟一标识。简而言之,第二范式就是非主属性非部分依赖于主关键字。
3 第三范式( 3NF )
满足第三范式( 3NF )必须先满足第二范式( 2NF )。简而言之,第三范式( 3NF )要求一个数据库表中不包含已在其它表中已包含的非主关键字信息。简而言之,第三范式就是属性不依赖于其它非主属性。
6. Denormalization
Databases intended for Online Transaction Processing (OLTP) are typically more normalized than databases intended for Online Analytical Processing (OLAP). OLTP Applications are characterized by a high volume of small transactions such as updating a sales record at a super market checkout counter. The expectation is that each transaction will leave the database in a consistent state. By contrast, databases intended for OLAP operations are primarily "read mostly" databases. OLAP applications tend to extract historical data that has accumulated over a long period of time. For such databases, redundant or "denormalized" data may facilitate Business Intelligence applications. Specifically, dimensional tables in a star schema often contain denormalized data. The denormalized or redundant data must be carefully controlled during ETL processing, and users should not be permitted to see the data until it is in a consistent state. The normalized alternative to the star schema is the snowflake schema. It has never been proven that this denormalization itself provides any increase in performance, or if the concurrent removal of data constraints is what increases the performance. In many cases, the need for denormalization has waned as computers and RDBMS software have become more powerful, but since data volumes have generally increased along with hardware and software performance, OLAP databases often still use denormalized schemas.
Denormalization is also used to improve performance on smaller computers as in computerized cash-registers and mobile devices, since these may use the data for look-up only (e.g. price lookups). Denormalization may also be used when no RDBMS exists for a platform (such as Palm), or no changes are to be made to the data and a swift response is crucial.
数据库的去规范化。 OLAP 主要是为了查询历史数据,所以为 OLAP 而设计的数据库中,一些冗余或者 ‘ 去规范 ’ 可能有利于为商业智能而开发的应用程序。尤其是在星型模式下的维度表中往往存在着非规范化的数据。
7. Data Warehouse
Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.
This classic definition of the data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.
In contrast to data warehouses are operational systems which perform day-to-day transaction processing.
"Datawarehousing"
The process of transforming data into information and making it available to the user in a timely enough manner to make a difference is known as data warehousing.
数据仓库是一个面向主题的、集成的、不可更新的、随时间不断变化的数据集合,它用于支持企业或组织的决策分析处理。
数据仓库,英文名称为 Data Warehouse ,可简写为 DW 。
数据仓库之父 Bill Inmon 在 1991 年出版的 “Building the Data Warehouse” 一书中所提出的定义被广泛接受 —— 数据仓库( Data Warehouse )是一个面向主题的( Subject Oriented )、集成的( Integrated )、相对稳定的( Non-Volatile )、反映历史变化( Time Variant )的数据集合,用于支持管理决策 (Decision Making Support) 。
面向主题:操作型数据库的数据组织面向事务处理任务,各个业务系统之间各自分离,而数据仓库中的数据是按照一定的主题域进行组织的。
集成的:数据仓库中的数据是在对原有分散的数据库数据抽取、清理的基础上经过系统加工、汇总和整理得到的,必须消除源数据中的不一致性,以保证数据仓库内的信息是关于整个企业的一致的全局信息。
相对稳定的:数据仓库的数据主要供企业决策分析之用,所涉及的数据操作主要是数据查询,一旦某个数据进入数据仓库以后,一般情况下将被长期保留,也就是数据仓库中一般有大量的查询操作,但修改和删除操作很少,通常只需要定期的加载、刷新。
反映历史变化:数据仓库中的数据通常包含历史信息,系统记录了企业从过去某一时点 ( 如开始应用数据仓库的时点 ) 到目前的各个阶段的信息,通过这些信息,可以对企业的发展历程和未来趋势做出定量分析和预测。
数据仓库是一个过程而不是一个项目。
数据仓库系统是一个信息提供平台,他从业务处理系统获得数据,主要以星型模型和雪花模型进行数据组织,并为用户提供各种手段从数据中获取信息和知识。
从功能结构化分,数据仓库系统至少应该包含数据获取( Data Acquisition )、数据存储( Data Storage )、数据访问( Data Access )三个关键部分。
8. ETL
Extract, Transform, and Load (ETL) is a process in data warehousing that involves
* extracting data from outside sources,
* transforming it to fit business needs (which can include quality levels), and ultimately
* loading it into the end target, i.e. the data warehouse.
ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database. ETL can also be used for the integration with legacy systems. Usually ETL implementations store an audit trail on positive and negative process runs. In almost all designs, this audit trail is not at the level of granularity which would allow to reproduce the ETL's result if the raw data were not available.
ETL , Extraction-Transformation-Loading 的缩写,中文名称为数据抽取、转换和加载。
ETL 负责将分布的、异构数据源中的数据如关系数据、平面数据文件等抽取到临时中间层后进行清洗、转换、集成,最后加载到数据仓库或数据集市中,成为联机分析处理、数据挖掘的基础。
ETL 是数据仓库中的非常重要的一环。它是承前启后的必要的一步。相对于关系数据库,数据仓库技术没有严格的数学理论基础,它更面向实际工程应用。所以从工程应用的角度来考虑,按着物理数据模型的要求加载数据并对数据进行一些系列处理,处理过程与经验直接相关,同时这部分的工作直接关系数据仓库中数据的质量,从而影响到联机分析处理和数据挖掘的结果的质量。
数据仓库是一个独立的数据环境,需要通过抽取过程将数据从联机事务处理环境、外部数据源和脱机的数据存储介质导入到数据仓库中;在技术上, ETL 主要涉及到关联、转换、增量、调度和监控等几个方面;数据仓库系统中数据不要求与联机事务处理系统中数据实时同步,所以 ETL 可以定时进行。但多个 ETL 的操作时间、顺序和成败对数据仓库中信息的有效性至关重要。
9. Star Schema and Snowflake Schema
The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.
A snowflake schema is a logical arrangement of tables in a relational database such that the entity relationship diagram resembles a snowflake in shape. Closely related to the star schema, the snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. In the snowflake schema, however, dimensions are normalized into multiple related tables whereas the star schema's dimensions are denormalized with each dimension being represented by a single table. When the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and where child tables have multiple parent tables ("forks in the road"), a complex snowflake shape starts to emerge. The "snowflaking" effect only affects the dimension tables and not the fact tables.
可以将星型模式描述为一个简单的星型:中央表包含事实数据,多个表以中央表为中心呈放射状分布,它们通过数据库的主键和外键相互连接。
雪花型模式表示一种维度模型,该模型也是由一个中央事实表和一组成员维度表组成,这些维度表可进一步规范化为子维度表。
10. Fact table and Dimension table
In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the centre of a star schema, surrounded by dimension tables.
Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (a region contains many stores).
In data warehousing, a dimension table is one of the set of companion tables to a fact table.
The fact table contains business facts or measures and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.
The dimension tables contain attributes (or fields) used to constrain and group data when performing data warehousing queries.
Over time, the attributes of a given row in a dimension table may change. For example, the shipping address for a company may change. Kimball refers to this phenomenon as Slowly Changing Dimensions. Strategies for dealing with this kind of change are divided into three categories:
* Type One - Simply overwrite the old value(s).
* Type Two - Add a new row containing the new value(s), and distinguish between the rows using Tuple-versioning techniques.
* Type Three - Add a new attribute to the existing row.
每个数据仓库都包含一个或者多个事实数据表。事实数据表可能包含业务销售数据,如现金登记事务所产生的数据,事实数据表通常包含大量的行。事实数据表的主要特点是包含数字数据(事实),并且这些数字信息可以汇总,以提供有关单位作为历史的数据,每个事实数据表包含一个由多个部分组成的索引,该索引包含作为外键的相关性维度表的主键,而维度表包含事实记录的特性。事实数据表不应该包含描述性的信息,也不应该包含除数字度量字段及使事实与维度表中对应项的相关索引字段之外的任何数据。包含在事实数据表中的 “ 度量值 ” 有两中:一种是可以累计的度量值,另一种是非累计的度量值。最有用的度量值是可累计的度量值,其累计起来的数字是非常有意义的。用户可以通过累计度量值获得汇总信息,例如。可以汇总具体时间段内一组商店的特定商品的销售情况。非累计的度量值也可以用于事实数据表,单汇总结果一般是没有意义的,例如,在一座大厦的不同位置测量温度时,如果将大厦中所有不同位置的温度累加是没有意义的,但是求平均值是有意义的。一般来说,一个事实数据表都要和一个或多个维度表相关联,用户在利用事实数据表创建多维数据集时,可以使用一个或多个维度表。
维度表可以看作是用户来分析数据的窗口,维度表中包含事实数据表中事实记录的特性,有些特性提供描述性信息,有些特性指定如何汇总事实数据表数据,以便为分析者提供有用的信息,维度表包含帮助汇总数据的特性的层次结构。例如,包含产品信息的维度表通常包含将产品分为食品、饮料、非消费品等若干类的层次结构,这些产品中的每一类进一步多次细分,直到各产品达到最低级别。在维度表中,每个表都包含独立于其他维度表的事实特性,例如,客户维度表包含有关客户的数据。维度表中的列字段可以将信息分为不同层次的结构级。