The data warehouse ETL toolkit: practical techniques for extracting, cleaning, conforming, and delivering data
Discusses how to use an ELT system, covering such topics as choosing an architecture, building a data cleaning subsystem, and finetuning the ELT process for optimum performance.
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Indianapolis
Wiley
2004
|
Schlagworte: | |
Online-Zugang: | Table of contents Contributor biographical information Publisher description Inhaltsverzeichnis |
Zusammenfassung: | Discusses how to use an ELT system, covering such topics as choosing an architecture, building a data cleaning subsystem, and finetuning the ELT process for optimum performance. |
Beschreibung: | Includes index. |
Beschreibung: | XXXIV, 491 S. graph. Darst. |
ISBN: | 0764567578 |
Internformat
MARC
LEADER | 00000nam a2200000zc 4500 | ||
---|---|---|---|
001 | BV021638238 | ||
003 | DE-604 | ||
005 | 20210610 | ||
007 | t | ||
008 | 060630s2004 xxud||| |||| 00||| eng d | ||
010 | |a 2004016909 | ||
020 | |a 0764567578 |9 0-764-56757-8 | ||
035 | |a (OCoLC)56012371 | ||
035 | |a (DE-599)BVBBV021638238 | ||
040 | |a DE-604 |b ger |e aacr | ||
041 | 0 | |a eng | |
044 | |a xxu |c US | ||
049 | |a DE-91G |a DE-523 |a DE-473 |a DE-384 |a DE-188 |a DE-2070s |a DE-863 |a DE-861 |a DE-M158 |a DE-703 |a DE-1052 | ||
050 | 0 | |a QA76.9.D37 | |
082 | 0 | |a 005.74 |2 22 | |
084 | |a ST 530 |0 (DE-625)143679: |2 rvk | ||
084 | |a DAT 620f |2 stub | ||
100 | 1 | |a Kimball, Ralph |d 1944- |e Verfasser |0 (DE-588)1031963626 |4 aut | |
245 | 1 | 0 | |a The data warehouse ETL toolkit |b practical techniques for extracting, cleaning, conforming, and delivering data |c Ralph Kimball, Joe Caserta |
264 | 1 | |a Indianapolis |b Wiley |c 2004 | |
300 | |a XXXIV, 491 S. |b graph. Darst. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
500 | |a Includes index. | ||
520 | 3 | |a Discusses how to use an ELT system, covering such topics as choosing an architecture, building a data cleaning subsystem, and finetuning the ELT process for optimum performance. | |
650 | 4 | |a Bases de données - Conception | |
650 | 4 | |a Entrepôts de données (Informatique) | |
650 | 4 | |a Data warehousing | |
650 | 4 | |a Database design | |
650 | 0 | 7 | |a Data Mining |0 (DE-588)4428654-5 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Data-Warehouse-Konzept |0 (DE-588)4406462-7 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Systementwurf |0 (DE-588)4261480-6 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Data-Warehouse-Konzept |0 (DE-588)4406462-7 |D s |
689 | 0 | |5 DE-604 | |
689 | 1 | 0 | |a Data-Warehouse-Konzept |0 (DE-588)4406462-7 |D s |
689 | 1 | 1 | |a Data Mining |0 (DE-588)4428654-5 |D s |
689 | 1 | 2 | |a Systementwurf |0 (DE-588)4261480-6 |D s |
689 | 1 | |5 DE-188 | |
700 | 1 | |a Caserta, Joe |d 1965- |e Verfasser |0 (DE-588)143982672 |4 aut | |
856 | 4 | |u http://www.loc.gov/catdir/toc/ecip0420/2004016909.html |3 Table of contents | |
856 | 4 | |u http://www.loc.gov/catdir/enhancements/fy0616/2004016909-b.html |3 Contributor biographical information | |
856 | 4 | |u http://www.loc.gov/catdir/enhancements/fy0616/2004016909-d.html |3 Publisher description | |
856 | 4 | 2 | |m Digitalisierung UB Bamberg |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014853064&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
943 | 1 | |a oai:aleph.bib-bvb.de:BVB01-014853064 |
Datensatz im Suchindex
DE-BY-863_location | 1000 |
---|---|
DE-BY-FWS_call_number | 1000/ST 530 K49st |
DE-BY-FWS_katkey | 433712 |
DE-BY-FWS_media_number | 083101211224 |
_version_ | 1824553869662748674 |
adam_text |
Contents
Acknowledgments
xvii
About the Authors
xix
Introduction
xxi
Part I Requirements, Realities, and Architecture
1
Chapter l
Surrounding the Requirements
3
Requirements
4
Business Needs
4
Compliance Requirements
4
Data Profiling
5
Security Requirements
6
Data Integration
7
Data Latency
7
Archiving and Lineage
8
End User Delivery Interfaces
8
Available Skills
9
Legacy Licenses
9
Architecture
9
ETL Tool versus Hand Coding (Buy a Tool Suite or Roll
Your Own?)
10
The Back Room
-
Preparing me Data
16
The Front Room
—
Data Access
20
The Mission of the Data Warehouse
22
What the Data Warehouse Is
22
What the Data Warehouse Is Not
23
Industry Terms Not Used Consistently
25
vi!
viii Contents
Resolving Architectural Conflict: A Hybrid Approach
27
How the Data Warehouse Is Changing
27
The Mission of the ETL Team
28
Chapter
2
ETL Data Structures
29
To Stage or Not to Stage
29
Designing the Staging Area
31
Data Structures in the ETL System
35
Hat Files
35
XML Data Sets
38
Relational Tables
40
Independent DBMS Working Tables
41
Third Normal Form Entity/Relation Models
42
Nonrelational Data Sources
42
Dimensional Data Models: The Handoff from the Back
Room to the Front Room
45
Fact Tables
45
Dimension Tables
46
Atomic and Aggregate Fact Tables
47
Surrogate Key Mapping Tables
48
Planning and Design Standards
48
Impact Analysis
49
Metadata Capture
49
Naming Conventions
51
Auditing Data Transformation Steps
51
Summary
52
Part II Data Flow
53
Chapter
3
Extracting
55
Part
1:
The Logical Data Map
56
Designing Logical Before Physical
56
Inside the Logical Data Map
58
Components of the Logical Data Map
58
Using Tools for the Logical Data Map
62
Building the Logical Data Map
62
Data Discovery Phase
63
Data Content Analysis
71
Collecting Business Rules in the ETL Process
73
Integrating Heterogeneous Data Sources
73
Part
2:
The Challenge of Extracting from Disparate
Platforms
76
Connecting to Diverse Sources through ODBC
76
Mainframe Sources
78
Working with COBOL Copybooks
78
EBCDIC Character Set
79
Converting EBCDIC to ASCII
80
Contents ix
Transferring Data
between Platforms
80
Handling Mainframe Numeric Data
81
Using Pictures
81
Unpacking Packed Decimals
83
Working with Redefined Fields
84
Multiple OCCURS
85
Managing Multiple Mainframe Record Type Files
87
Handling Mainframe Variable Record Lengths
89
Flat Files
90
Processing Fixed Length Flat Files
91
Processing Delimited Flat Files
93
XML Sources
93
Character Sets
94
XML
Meta Data
94
Web Log Sources
97
W3C Common and Extended Formats
98
Name Value Pairs in Web Logs
100
ERP System Sources
102
Part
3:
Extracting Changed Data
105
Detecting Changes
106
Extraction Tips
109
Detecting Deleted or Overwritten Fact Records at the Source 111
Summary 111
Chapter
4
Cleaning and Conforming
113
Defining Data Quality
115
Assumptions
116
Part
1:
Design Objectives
117
Understand Your Key Constituencies
117
Competing Factors
119
Balancing Conflicting Priorities
120
Formulate a Policy
122
Part
2:
Cleaning Deliverables
124
Data Profiling Deliverable
125
Cleaning Deliverable
#1 :
Error Event Table
125
Cleaning Deliverable
#2:
Audit Dimension
128
Audit
Dimension Fine
Points
130
Part
3:
Screens and Their Measurements
131
Anomaly Detection Phase
131
Types of Enforcement
134
Column Property Enforcement
134
Structure Enforcement
135
Data and Value Rule Enforcement
135
Measurements Driving Screen Design
136
Overall Process Flow
136
The Show Must Go On—Usually
138
Screens
139
Contents
Known Table Row Counts
140
Column Nullity
140
Column Numeric and Date Ranges
141
Column Length Restriction
143
Column Explicit Valid Values
143
Column Explicit Invalid Values
144
Checking Table Row Count Reasonability
144
Checking Column Distribution Reasonability
146
General Data and Value Rule Reasonability
147
Part
4:
Conforming Deliverables
148
Conformed Dimensions
148
Designing the Conformed Dimensions
150
Taking the Pledge
150
Permissible Variations of Conformed Dimensions
150
Conformed Facts
151
The Fact Table Provider
152
The Dimension Manager: Publishing Conformed
Dimensions to Affected Fact Tables
152
Detailed Delivery Steps for Conformed Dimensions
153
Implementing the Conforming Modules
155
Matching Drives Deduplication
156
Surviving: Final Step of Conforming
158
Delivering
159
Summary
160
Chapter
5
Delivering Dimension Tables
161
The Basic Structure of a Dimension
162
The Grain of a Dimension
165
The Basic Load Plan for a Dimension
166
Flat Dimensions and Snowflaked Dimensions
167
Date and Time Dimensions
170
Big Dimensions
174
Small Dimensions
176
One Dimension or Two
176
Dimensional Roles
178
Dimensions as
Subdimensions
of Another Dimension
180
Degenerate Dimensions
182
Slowly Changing Dimensions
183
Type
1
Slowly Changing Dimension (Overwrite)
183
Type
2
Slowly Changing Dimension (Partitioning History)
185
Precise Time Stamping of a Type
2
Slowly Changing
Dimension
190
Type
3
Slowly Changing Dimension (Alternate Realities)
192
Hybrid Slowly Changing Dimensions
193
Late-Arriving Dimension Records and Correcting Bad Data
194
Multivalued Dimensions and Bridge Tables
196
Ragged Hierarchies and Bridge Tables
199
Technical Note: POPULATING HIERARCHY BRIDGE TABLES
201
Contents xi
Using Positional Attributes in a Dimension to Represent
Text Facts
204
Summary
207
Chapter
6
Delivering Fact Tables
209
The Basic Structure of a Fact Table
210
Guaranteeing Referential Integrity
212
Surrogate Key Pipeline
214
Using the Dimension Instead of a Lookup Table
217
Fundamental Grains
217
Transaction Grain Fact Tables
218
Periodic Snapshot Fact Tables
220
Accumulating Snapshot Fact Tables
222
Preparing for Loading Fact Tables
224
Managing Indexes
224
Managing Partitions
224
Outwitting the Rollback Log
226
Loading the Data
226
Incremental Loading
228
Inserting Facts
228
Updating and Correcting Facts
228
Negating Facts
229
Updating Facts
■ 230
Deleting Facts
230
Physically Deleting Facts
230
Logically Deleting Facts
232
Factless Fact Tables
232
Augmenting a Type
1
Fact Table with Type
2
History
234
Graceful Modifications
235
Multiple Units of Measure in a Fact Table
237
Collecting Revenue in Multiple Currencies
238
Late Arriving Facts
239
Aggregations
241
Design Requirement
#1 243
Design Requirement
#2 244
Design Requirement
#3 245
Design Requirement
#4 246
Administering Aggregations, Including Materialized
Views
246
Delivering Dimensional Data to
OLAP
Cubes
247
Cube Data Sources
248
Processing Dimensions
248
Changes in Dimension Data
249
Processing Facts
250
Integrating
OLAP
Processing into the ETL System
252
OLAP
Wrap-up
253
Summary
253
xii Contents
Part III Implementation and operations
Chapter
7
255
Chapter
8
Development
257
Current Marketplace ETL Tool Suite Offerings
258
Current Scripting Languages
260
Time Is of the Essence
260
Push Me or Pull Me
261
Ensuring Transfers with Sentinels
262
Sorting Data during Preload
263
Sorting on Mainframe Systems
264
Sorting on Unix and Windows Systems
266
Trimming the Fat (Filtering)
269
Extracting a Subset of the Source File Records on Mainframe
Systems
269
Extracting a Subset of the Source File Fields
270
Extracting a Subset of the Source File Records on Unix and
Windows Systems
271
Extracting a Subset of the Source File Fields
273
Creating Aggregated Extracts on Mainframe Systems
274
Creating Aggregated Extracts on UNIX and Windows
Systems
274
Using Database Bulk Loader Utilities to Speed Inserts
276
Preparing for Bulk Load
278
Managing Database Features to Improve Performance
280
The Order of Things
282
The Effect of Aggregates and Group
Bys
on Performance
286
Performance Impact of Using Scalar Functions
287
Avoiding Triggers
287
Overcoming ODBC the Bottleneck
288
Benefiting from Parallel Processing
288
Troubleshooting Performance Problems
292
Increasing ETL Throughput
294
Reducing Input/Output Contention
296
Eliminating Database Reads/Writes
296
Filtering as Soon as Possible
297
Partitioning and Parallelizing
297
Updating Aggregates Incrementally
298
Taking Only What You Need
299
Bulk Loading/Eliminating Logging
299
Dropping Databases Constraints and Indexes
299
Eliminating Network Traffic
300
Letting the ETL Engine Do the Work
300
Summary
300
Operations
301
Scheduling and Support
302
Reliability, Availability, Manageability Analysis for ETL
302
ETL Scheduling
101
303
Contents
xiii
Scheduling Tools
304
Load Dependencies
314
Metadata
314
Migrating to Production
315
Operational Support for the Data Warehouse
316
Bundling Version Releases
316
Supporting the ETL System in Production
319
Achieving Optimal ETL Performance
320
Estimating Load Time
321
Vulnerabilities of Long-Running ETL processes
324
Minimizing the Risk of Load Failures
330
Purging Historic Data
330
Monitoring the ETL System
331
Measuring ETL Specific Performance Indicators
331
Measuring Infrastructure Performance Indicators
332
Measuring Data Warehouse Usage to Help Manage ETL
Processes
337
Tuning ETL Processes
339
Explaining Database Overhead
340
ETL System Security
343
Securing the Development Environment
344
Securing the Production Environment
344
Short-Term Archiving and Recovery
345
Long-Term Archiving and Recovery
346
Media, Formats, Software, and Hardware
347
Obsolete Formats and Archaic Formats
347
Hard Copy, Standards, and Museums
348
Refreshing, Migrating, Emulating, and Encapsulating
349
Summary
350
Chapter
9
Metadata
351
Defining Metadata
352
Metadata—What Is It?
352
Source System Metadata
353
Data-Staging Metadata
354
DBMS Metadata
355
Front Room Metadata
356
Business Metadata
359
Business Definitions
360
Source System Information
361
Data Warehouse Data Dictionary
362
Logical Data Maps
363
Technical Metadata
363
System Inventory
364
Data Models
365
Data Definitions
365
Business Rules
366
ETL-Generated Metadata
367
xiv Contents
Chapter
10
Part IV
Chapter
11
ETL Job Metadata
368
Transformation Metadata
370
Batch Metadata
373
Data Quality Error Event Metadata
374
Process Execution Metadata
375
Metadata Standards and Practices
377
Establishing Rudimentary Standards
378
Naming Conventions
379
Impact Analysis
380
Summary
380
Responsibilities
383
Planning and Leadership
383
Having Dedicated Leadership
384
Planning Large, Building Small
385
Hiring Qualified Developers
387
Building Teams with Database Expertise
387
Don't Try to Save the World
388
Enforcing Standardization
388
Monitoring, Auditing, and Publishing Statistics
389
Maintaining Documentation
389
Providing and Utilizing Metadata
390
Keeping It Simple
390
Optimizing Throughput
390
Managing the Project
391
Responsibility of the ETL Team
391
Defining the Project
392
Planning the Project
393
Determining the Tool Set
393
Staffing Your Project
394
Project Plan Guidelines
401
Managing Scope
412
Summary
416
Real Time Streaming ETL Systems
419
Real-Time ETL Systems
421
Why
Real-Time ETL?
422
Defining
Real-Time ETL
424
Challenges and Opportunities of
Real-Tìme Data
Warehousing
424
Real-Time
Data Warehousing Review
425
Generation
1—
The Operational Data Store
425
Generation
2—
The Real-Time Partition
426
Recent CRM Trends
428
The Strategic Role of the Dimension Manager
429
Categorizing the Requirement
430
Contents xv
Data
Freshness and Historical Needs
430
Reporting Only or Integration, Too?
432
Just the Facts or Dimension Changes, Too?
432
Alerts, Continuous Polling, or Nonevents?
433
Data Integration or Application Integration?
434
Point-to-Point versus Hub-and-Spoke
434
Customer Data Cleanup Considerations
436
Real-Time ETL Approaches
437
Microbatch ETL
437
Enterprise Application Integration
441
Capture, Transform, and Flow
444
Enterprise Information Integration
446
The
Real-Time
Dimension Manager
447
Microbatch
Processing
452
Choosing an Approach
—
A Decision Guide
456
Summary
459
Chapter
12
Conclusions
461
Deepening the Definition of ETL
461
The Future of Data Warehousing and ETL in Particular
463
Ongoing Evolution of ETL Systems
464
Index
467 |
adam_txt |
Contents
Acknowledgments
xvii
About the Authors
xix
Introduction
xxi
Part I Requirements, Realities, and Architecture
1
Chapter l
Surrounding the Requirements
3
Requirements
4
Business Needs
4
Compliance Requirements
4
Data Profiling
5
Security Requirements
6
Data Integration
7
Data Latency
7
Archiving and Lineage
8
End User Delivery Interfaces
8
Available Skills
9
Legacy Licenses
9
Architecture
9
ETL Tool versus Hand Coding (Buy a Tool Suite or Roll
Your Own?)
10
The Back Room
-
Preparing me Data
16
The Front Room
—
Data Access
20
The Mission of the Data Warehouse
22
What the Data Warehouse Is
22
What the Data Warehouse Is Not
23
Industry Terms Not Used Consistently
25
vi!
viii Contents
Resolving Architectural Conflict: A Hybrid Approach
27
How the Data Warehouse Is Changing
27
The Mission of the ETL Team
28
Chapter
2
ETL Data Structures
29
To Stage or Not to Stage
29
Designing the Staging Area
31
Data Structures in the ETL System
35
Hat Files
35
XML Data Sets
38
Relational Tables
40
Independent DBMS Working Tables
41
Third Normal Form Entity/Relation Models
42
Nonrelational Data Sources
42
Dimensional Data Models: The Handoff from the Back
Room to the Front Room
45
Fact Tables
45
Dimension Tables
46
Atomic and Aggregate Fact Tables
47
Surrogate Key Mapping Tables
48
Planning and Design Standards
48
Impact Analysis
49
Metadata Capture
49
Naming Conventions
51
Auditing Data Transformation Steps
51
Summary
52
Part II Data Flow
53
Chapter
3
Extracting
55
Part
1:
The Logical Data Map
56
Designing Logical Before Physical
56
Inside the Logical Data Map
58
Components of the Logical Data Map
58
Using Tools for the Logical Data Map
62
Building the Logical Data Map
62
Data Discovery Phase
63
Data Content Analysis
71
Collecting Business Rules in the ETL Process
73
Integrating Heterogeneous Data Sources
73
Part
2:
The Challenge of Extracting from Disparate
Platforms
76
Connecting to Diverse Sources through ODBC
76
Mainframe Sources
78
Working with COBOL Copybooks
78
EBCDIC Character Set
79
Converting EBCDIC to ASCII
80
Contents ix
Transferring Data
between Platforms
80
Handling Mainframe Numeric Data
81
Using Pictures
81
Unpacking Packed Decimals
83
Working with Redefined Fields
84
Multiple OCCURS
85
Managing Multiple Mainframe Record Type Files
87
Handling Mainframe Variable Record Lengths
89
Flat Files
90
Processing Fixed Length Flat Files
91
Processing Delimited Flat Files
93
XML Sources
93
Character Sets
94
XML
Meta Data
94
Web Log Sources
97
W3C Common and Extended Formats
98
Name Value Pairs in Web Logs
100
ERP System Sources
102
Part
3:
Extracting Changed Data
105
Detecting Changes
106
Extraction Tips
109
Detecting Deleted or Overwritten Fact Records at the Source 111
Summary 111
Chapter
4
Cleaning and Conforming
113
Defining Data Quality
115
Assumptions
116
Part
1:
Design Objectives
117
Understand Your Key Constituencies
117
Competing Factors
119
Balancing Conflicting Priorities
120
Formulate a Policy
122
Part
2:
Cleaning Deliverables
124
Data Profiling Deliverable
125
Cleaning Deliverable
#1 :
Error Event Table
125
Cleaning Deliverable
#2:
Audit Dimension
128
Audit
Dimension Fine
Points
130
Part
3:
Screens and Their Measurements
131
Anomaly Detection Phase
131
Types of Enforcement
134
Column Property Enforcement
134
Structure Enforcement
135
Data and Value Rule Enforcement
135
Measurements Driving Screen Design
136
Overall Process Flow
136
The Show Must Go On—Usually
138
Screens
139
Contents
Known Table Row Counts
140
Column Nullity
140
Column Numeric and Date Ranges
141
Column Length Restriction
143
Column Explicit Valid Values
143
Column Explicit Invalid Values
144
Checking Table Row Count Reasonability
144
Checking Column Distribution Reasonability
146
General Data and Value Rule Reasonability
147
Part
4:
Conforming Deliverables
148
Conformed Dimensions
148
Designing the Conformed Dimensions
150
Taking the Pledge
150
Permissible Variations of Conformed Dimensions
150
Conformed Facts
151
The Fact Table Provider
152
The Dimension Manager: Publishing Conformed
Dimensions to Affected Fact Tables
152
Detailed Delivery Steps for Conformed Dimensions
153
Implementing the Conforming Modules
155
Matching Drives Deduplication
156
Surviving: Final Step of Conforming
158
Delivering
159
Summary
160
Chapter
5
Delivering Dimension Tables
161
The Basic Structure of a Dimension
162
The Grain of a Dimension
165
The Basic Load Plan for a Dimension
166
Flat Dimensions and Snowflaked Dimensions
167
Date and Time Dimensions
170
Big Dimensions
174
Small Dimensions
176
One Dimension or Two
176
Dimensional Roles
178
Dimensions as
Subdimensions
of Another Dimension
180
Degenerate Dimensions
182
Slowly Changing Dimensions
183
Type
1
Slowly Changing Dimension (Overwrite)
183
Type
2
Slowly Changing Dimension (Partitioning History)
185
Precise Time Stamping of a Type
2
Slowly Changing
Dimension
190
Type
3
Slowly Changing Dimension (Alternate Realities)
192
Hybrid Slowly Changing Dimensions
193
Late-Arriving Dimension Records and Correcting Bad Data
194
Multivalued Dimensions and Bridge Tables
196
Ragged Hierarchies and Bridge Tables
199
Technical Note: POPULATING HIERARCHY BRIDGE TABLES
201
Contents xi
Using Positional Attributes in a Dimension to Represent
Text Facts
204
Summary
207
Chapter
6
Delivering Fact Tables
209
The Basic Structure of a Fact Table
210
Guaranteeing Referential Integrity
212
Surrogate Key Pipeline
214
Using the Dimension Instead of a Lookup Table
217
Fundamental Grains
217
Transaction Grain Fact Tables
218
Periodic Snapshot Fact Tables
220
Accumulating Snapshot Fact Tables
222
Preparing for Loading Fact Tables
224
Managing Indexes
224
Managing Partitions
224
Outwitting the Rollback Log
226
Loading the Data
226
Incremental Loading
228
Inserting Facts
228
Updating and Correcting Facts
228
Negating Facts
229
Updating Facts
■ 230
Deleting Facts
230
Physically Deleting Facts
230
Logically Deleting Facts
232
Factless Fact Tables
232
Augmenting a Type
1
Fact Table with Type
2
History
234
Graceful Modifications
235
Multiple Units of Measure in a Fact Table
237
Collecting Revenue in Multiple Currencies
238
Late Arriving Facts
239
Aggregations
241
Design Requirement
#1 243
Design Requirement
#2 244
Design Requirement
#3 245
Design Requirement
#4 246
Administering Aggregations, Including Materialized
Views
246
Delivering Dimensional Data to
OLAP
Cubes
247
Cube Data Sources
248
Processing Dimensions
248
Changes in Dimension Data
249
Processing Facts
250
Integrating
OLAP
Processing into the ETL System
252
OLAP
Wrap-up
253
Summary
253
xii Contents
Part III Implementation and operations
Chapter
7
255
Chapter
8
Development
257
Current Marketplace ETL Tool Suite Offerings
258
Current Scripting Languages
260
Time Is of the Essence
260
Push Me or Pull Me
261
Ensuring Transfers with Sentinels
262
Sorting Data during Preload
263
Sorting on Mainframe Systems
264
Sorting on Unix and Windows Systems
266
Trimming the Fat (Filtering)
269
Extracting a Subset of the Source File Records on Mainframe
Systems
269
Extracting a Subset of the Source File Fields
270
Extracting a Subset of the Source File Records on Unix and
Windows Systems
271
Extracting a Subset of the Source File Fields
273
Creating Aggregated Extracts on Mainframe Systems
274
Creating Aggregated Extracts on UNIX and Windows
Systems
274
Using Database Bulk Loader Utilities to Speed Inserts
276
Preparing for Bulk Load
278
Managing Database Features to Improve Performance
280
The Order of Things
282
The Effect of Aggregates and Group
Bys
on Performance
286
Performance Impact of Using Scalar Functions
287
Avoiding Triggers
287
Overcoming ODBC the Bottleneck
288
Benefiting from Parallel Processing
288
Troubleshooting Performance Problems
292
Increasing ETL Throughput
294
Reducing Input/Output Contention
296
Eliminating Database Reads/Writes
296
Filtering as Soon as Possible
297
Partitioning and Parallelizing
297
Updating Aggregates Incrementally
298
Taking Only What You Need
299
Bulk Loading/Eliminating Logging
299
Dropping Databases Constraints and Indexes
299
Eliminating Network Traffic
300
Letting the ETL Engine Do the Work
300
Summary
300
Operations
301
Scheduling and Support
302
Reliability, Availability, Manageability Analysis for ETL
302
ETL Scheduling
101
303
Contents
xiii
Scheduling Tools
304
Load Dependencies
314
Metadata
314
Migrating to Production
315
Operational Support for the Data Warehouse
316
Bundling Version Releases
316
Supporting the ETL System in Production
319
Achieving Optimal ETL Performance
320
Estimating Load Time
321
Vulnerabilities of Long-Running ETL processes
324
Minimizing the Risk of Load Failures
330
Purging Historic Data
330
Monitoring the ETL System
331
Measuring ETL Specific Performance Indicators
331
Measuring Infrastructure Performance Indicators
332
Measuring Data Warehouse Usage to Help Manage ETL
Processes
337
Tuning ETL Processes
339
Explaining Database Overhead
340
ETL System Security
343
Securing the Development Environment
344
Securing the Production Environment
344
Short-Term Archiving and Recovery
345
Long-Term Archiving and Recovery
346
Media, Formats, Software, and Hardware
347
Obsolete Formats and Archaic Formats
347
Hard Copy, Standards, and Museums
348
Refreshing, Migrating, Emulating, and Encapsulating
349
Summary
350
Chapter
9
Metadata
351
Defining Metadata
352
Metadata—What Is It?
352
Source System Metadata
353
Data-Staging Metadata
354
DBMS Metadata
355
Front Room Metadata
356
Business Metadata
359
Business Definitions
360
Source System Information
361
Data Warehouse Data Dictionary
362
Logical Data Maps
363
Technical Metadata
363
System Inventory
364
Data Models
365
Data Definitions
365
Business Rules
366
ETL-Generated Metadata
367
xiv Contents
Chapter
10
Part IV
Chapter
11
ETL Job Metadata
368
Transformation Metadata
370
Batch Metadata
373
Data Quality Error Event Metadata
374
Process Execution Metadata
375
Metadata Standards and Practices
377
Establishing Rudimentary Standards
378
Naming Conventions
379
Impact Analysis
380
Summary
380
Responsibilities
383
Planning and Leadership
383
Having Dedicated Leadership
384
Planning Large, Building Small
385
Hiring Qualified Developers
387
Building Teams with Database Expertise
387
Don't Try to Save the World
388
Enforcing Standardization
388
Monitoring, Auditing, and Publishing Statistics
389
Maintaining Documentation
389
Providing and Utilizing Metadata
390
Keeping It Simple
390
Optimizing Throughput
390
Managing the Project
391
Responsibility of the ETL Team
391
Defining the Project
392
Planning the Project
393
Determining the Tool Set
393
Staffing Your Project
394
Project Plan Guidelines
401
Managing Scope
412
Summary
416
Real Time Streaming ETL Systems
419
Real-Time ETL Systems
421
Why
Real-Time ETL?
422
Defining
Real-Time ETL
424
Challenges and Opportunities of
Real-Tìme Data
Warehousing
424
Real-Time
Data Warehousing Review
425
Generation
1—
The Operational Data Store
425
Generation
2—
The Real-Time Partition
426
Recent CRM Trends
428
The Strategic Role of the Dimension Manager
429
Categorizing the Requirement
430
Contents xv
Data
Freshness and Historical Needs
430
Reporting Only or Integration, Too?
432
Just the Facts or Dimension Changes, Too?
432
Alerts, Continuous Polling, or Nonevents?
433
Data Integration or Application Integration?
434
Point-to-Point versus Hub-and-Spoke
434
Customer Data Cleanup Considerations
436
Real-Time ETL Approaches
437
Microbatch ETL
437
Enterprise Application Integration
441
Capture, Transform, and Flow
444
Enterprise Information Integration
446
The
Real-Time
Dimension Manager
447
Microbatch
Processing
452
Choosing an Approach
—
A Decision Guide
456
Summary
459
Chapter
12
Conclusions
461
Deepening the Definition of ETL
461
The Future of Data Warehousing and ETL in Particular
463
Ongoing Evolution of ETL Systems
464
Index
467 |
any_adam_object | 1 |
any_adam_object_boolean | 1 |
author | Kimball, Ralph 1944- Caserta, Joe 1965- |
author_GND | (DE-588)1031963626 (DE-588)143982672 |
author_facet | Kimball, Ralph 1944- Caserta, Joe 1965- |
author_role | aut aut |
author_sort | Kimball, Ralph 1944- |
author_variant | r k rk j c jc |
building | Verbundindex |
bvnumber | BV021638238 |
callnumber-first | Q - Science |
callnumber-label | QA76 |
callnumber-raw | QA76.9.D37 |
callnumber-search | QA76.9.D37 |
callnumber-sort | QA 276.9 D37 |
callnumber-subject | QA - Mathematics |
classification_rvk | ST 530 |
classification_tum | DAT 620f |
ctrlnum | (OCoLC)56012371 (DE-599)BVBBV021638238 |
dewey-full | 005.74 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 005 - Computer programming, programs, data, security |
dewey-raw | 005.74 |
dewey-search | 005.74 |
dewey-sort | 15.74 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
discipline_str_mv | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>00000nam a2200000zc 4500</leader><controlfield tag="001">BV021638238</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20210610</controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">060630s2004 xxud||| |||| 00||| eng d</controlfield><datafield tag="010" ind1=" " ind2=" "><subfield code="a">2004016909</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">0764567578</subfield><subfield code="9">0-764-56757-8</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)56012371</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV021638238</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">aacr</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-91G</subfield><subfield code="a">DE-523</subfield><subfield code="a">DE-473</subfield><subfield code="a">DE-384</subfield><subfield code="a">DE-188</subfield><subfield code="a">DE-2070s</subfield><subfield code="a">DE-863</subfield><subfield code="a">DE-861</subfield><subfield code="a">DE-M158</subfield><subfield code="a">DE-703</subfield><subfield code="a">DE-1052</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">QA76.9.D37</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">005.74</subfield><subfield code="2">22</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 530</subfield><subfield code="0">(DE-625)143679:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">DAT 620f</subfield><subfield code="2">stub</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Kimball, Ralph</subfield><subfield code="d">1944-</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1031963626</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">The data warehouse ETL toolkit</subfield><subfield code="b">practical techniques for extracting, cleaning, conforming, and delivering data</subfield><subfield code="c">Ralph Kimball, Joe Caserta</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Indianapolis</subfield><subfield code="b">Wiley</subfield><subfield code="c">2004</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XXXIV, 491 S.</subfield><subfield code="b">graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Includes index.</subfield></datafield><datafield tag="520" ind1="3" ind2=" "><subfield code="a">Discusses how to use an ELT system, covering such topics as choosing an architecture, building a data cleaning subsystem, and finetuning the ELT process for optimum performance.</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Bases de données - Conception</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Entrepôts de données (Informatique)</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data warehousing</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Database design</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data-Warehouse-Konzept</subfield><subfield code="0">(DE-588)4406462-7</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Systementwurf</subfield><subfield code="0">(DE-588)4261480-6</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Data-Warehouse-Konzept</subfield><subfield code="0">(DE-588)4406462-7</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="689" ind1="1" ind2="0"><subfield code="a">Data-Warehouse-Konzept</subfield><subfield code="0">(DE-588)4406462-7</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="1" ind2="1"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="1" ind2="2"><subfield code="a">Systementwurf</subfield><subfield code="0">(DE-588)4261480-6</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="1" ind2=" "><subfield code="5">DE-188</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Caserta, Joe</subfield><subfield code="d">1965-</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)143982672</subfield><subfield code="4">aut</subfield></datafield><datafield tag="856" ind1="4" ind2=" "><subfield code="u">http://www.loc.gov/catdir/toc/ecip0420/2004016909.html</subfield><subfield code="3">Table of contents</subfield></datafield><datafield tag="856" ind1="4" ind2=" "><subfield code="u">http://www.loc.gov/catdir/enhancements/fy0616/2004016909-b.html</subfield><subfield code="3">Contributor biographical information</subfield></datafield><datafield tag="856" ind1="4" ind2=" "><subfield code="u">http://www.loc.gov/catdir/enhancements/fy0616/2004016909-d.html</subfield><subfield code="3">Publisher description</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Bamberg</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014853064&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="943" ind1="1" ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-014853064</subfield></datafield></record></collection> |
id | DE-604.BV021638238 |
illustrated | Illustrated |
index_date | 2024-07-02T14:59:09Z |
indexdate | 2025-02-20T06:42:44Z |
institution | BVB |
isbn | 0764567578 |
language | English |
lccn | 2004016909 |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-014853064 |
oclc_num | 56012371 |
open_access_boolean | |
owner | DE-91G DE-BY-TUM DE-523 DE-473 DE-BY-UBG DE-384 DE-188 DE-2070s DE-863 DE-BY-FWS DE-861 DE-M158 DE-703 DE-1052 |
owner_facet | DE-91G DE-BY-TUM DE-523 DE-473 DE-BY-UBG DE-384 DE-188 DE-2070s DE-863 DE-BY-FWS DE-861 DE-M158 DE-703 DE-1052 |
physical | XXXIV, 491 S. graph. Darst. |
publishDate | 2004 |
publishDateSearch | 2004 |
publishDateSort | 2004 |
publisher | Wiley |
record_format | marc |
spellingShingle | Kimball, Ralph 1944- Caserta, Joe 1965- The data warehouse ETL toolkit practical techniques for extracting, cleaning, conforming, and delivering data Bases de données - Conception Entrepôts de données (Informatique) Data warehousing Database design Data Mining (DE-588)4428654-5 gnd Data-Warehouse-Konzept (DE-588)4406462-7 gnd Systementwurf (DE-588)4261480-6 gnd |
subject_GND | (DE-588)4428654-5 (DE-588)4406462-7 (DE-588)4261480-6 |
title | The data warehouse ETL toolkit practical techniques for extracting, cleaning, conforming, and delivering data |
title_auth | The data warehouse ETL toolkit practical techniques for extracting, cleaning, conforming, and delivering data |
title_exact_search | The data warehouse ETL toolkit practical techniques for extracting, cleaning, conforming, and delivering data |
title_exact_search_txtP | The data warehouse ETL toolkit practical techniques for extracting, cleaning, conforming, and delivering data |
title_full | The data warehouse ETL toolkit practical techniques for extracting, cleaning, conforming, and delivering data Ralph Kimball, Joe Caserta |
title_fullStr | The data warehouse ETL toolkit practical techniques for extracting, cleaning, conforming, and delivering data Ralph Kimball, Joe Caserta |
title_full_unstemmed | The data warehouse ETL toolkit practical techniques for extracting, cleaning, conforming, and delivering data Ralph Kimball, Joe Caserta |
title_short | The data warehouse ETL toolkit |
title_sort | the data warehouse etl toolkit practical techniques for extracting cleaning conforming and delivering data |
title_sub | practical techniques for extracting, cleaning, conforming, and delivering data |
topic | Bases de données - Conception Entrepôts de données (Informatique) Data warehousing Database design Data Mining (DE-588)4428654-5 gnd Data-Warehouse-Konzept (DE-588)4406462-7 gnd Systementwurf (DE-588)4261480-6 gnd |
topic_facet | Bases de données - Conception Entrepôts de données (Informatique) Data warehousing Database design Data Mining Data-Warehouse-Konzept Systementwurf |
url | http://www.loc.gov/catdir/toc/ecip0420/2004016909.html http://www.loc.gov/catdir/enhancements/fy0616/2004016909-b.html http://www.loc.gov/catdir/enhancements/fy0616/2004016909-d.html http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014853064&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT kimballralph thedatawarehouseetltoolkitpracticaltechniquesforextractingcleaningconforminganddeliveringdata AT casertajoe thedatawarehouseetltoolkitpracticaltechniquesforextractingcleaningconforminganddeliveringdata |
Inhaltsverzeichnis
THWS Würzburg Zentralbibliothek Lesesaal
Signatur: |
1000 ST 530 K49st |
---|---|
Exemplar 1 | ausleihbar Verfügbar Bestellen |