Data Profiling Process

Data Profiling Process

After you create a data profile, you can run the profile session and view the results in a report.

The following steps describe the data profiling process:

  • Create a data profile.
  • Run the profile session.
  • View the reports.

1. Create a data profile. Use the Profile Wizard in the Designer to create a data profile based on a source definition and a set of functions. The Profile Wizard generates a mapping and a session based on criteria that you provide.

2. Run the profile session. You can choose to run the profile session when you finish the Profile Wizard, or you can run it from the Profile Manager. The Integration Service runs the session and loads the profile results to the Data Profiling warehouse.

3. View the reports. View the Data Profiling report associated with the profile session. Based on the type of profile report, you can view reports from the Profile Manager or from Data Analyzer.



Data Profiling Connectivity

Data Profiling Connectivity


Power Center Data Profiling uses the following types of connectivity:

  • TCP/IP.
  • Native.
  • ODBC.
  • JDBC.

TCP/IP. The PowerCenter Client and the Integration Service use native protocol to communicate with the Repository Service.

Native. The Integration Service uses native database connectivity to connect to the Data Profiling warehouse when it loads target data from the profiling sessions.

ODBC. The PowerCenter Client uses ODBC to connect to the Data Profiling warehouse when you run data profiling reports from the Profile Manager.

JDBC. Data Analyzer uses JDBC to connect to the Data Profiling warehouse when you run data profiling reports.


Data Profiling Components

Data Profiling Components


To understand data profiling, you need to be familiar with the following components:

  • PowerCenter Client.
  • PowerCenter Data Profile.
  • Data Profiling warehouse.
  • Data Profiling reports.

PowerCenter Client. Use the PowerCenter Client to create and manage data profiles.

PowerCenter Data Profile. Metadata that you generate in the PowerCenter Client that defines what types of statistics you want to collect for a source. It is comprised of a source definition, a profile mapping, and a profile session.

Data Profiling warehouse. The Data Profiling warehouse stores results from profile sessions and reports that you run to view the results.

Data Profiling reports. View data and metadata in Data Profiling reports.

PowerCenter Client


Use the following PowerCenter Client tools to create and manage data profiles:

  • Designer.
  • Profile Manager.

Designer. Create data profiles from the Source Analyzer or the Mapplet Designer. When you create a data profile, the Designer generates a profile mapping based on the profile functions. The PowerCenter repository stores the profile mappings and metadata. If the repository is versioned, profile mappings are versioned in the same way other PowerCenter mappings are versioned.

Profile Manager. A tool in the PowerCenter Designer that you use to manage data profiles. You can edit and regenerate profiles, run profile sessions, and view profile results.


PowerCenter Data Profile


A data profile contains the source definitions, the functions and function parameters, and the profile session run parameters. To create a data profile, you run the Profile Wizard from the PowerCenter Designer. When you create a data profile, you create the following repository objects:

  • Profile.
  • Profile mapping.
  • Profile session.

Profile. A profile is a repository object that represents all the metadata configured in the wizard. You create the profile based on a mapplet or source definition and a set of functions.

Profile mapping. When you create a data profile, the Profile Wizard generates a profile mapping. Select functions in the wizard that to help determine the content, structure, and quality of the profile source. You can use pre-defined or custom functions. The Profile Wizard creates transformations and adds targets based on the functions that you supply. You can view the profile mapping in the Mapping Designer.

Profile session. After the Profile Wizard generates a profile mapping, you provide basic session information such as Integration Service name and connection information to the source and the Data Profiling warehouse. The Profiling Wizard creates a profile session and a profile workflow. You can choose to run the profile session when the wizard completes, or you can run it later. When you run a profile session, the Integration Service writes profile results to the Data Profiling warehouse.

While profiles are not versioned, the profile mappings and profile sessions are versioned objects.

Data Profiling Warehouse


The Data Profiling warehouse is a set of tables that stores the results from profile sessions. It also contains reports that you run to view the profile session results. You can create a Data Profiling warehouse on any relational database that PowerCenter supports as a source or target database. Create a Data Profiling warehouse for each PowerCenter repository you want to store data profiles in.


Data Profiling Reports


You can view the results of each function configured in the data profile. Based on the type of metadata you want to view, you can view reports from the following tools:

  • Profile Manager.
  • Data Analyzer.

Profile Manager. PowerCenter Data Profiling reports provide information about the latest session run. View them from the Profile Manager.

Data Analyzer. Data Analyzer Data Profiling reports provide composite, metadata, and summary reports. View them from the Data Profiling dashboard in Data Analyzer. You can also customize the reports in Data Analyzer.


What is Data Profiling

What is Data Profiling


Data profiling is a technique used to analyze the content, quality, and structure of source data. Use Power Center Data Profiling to detect patterns and exceptions of source data during mapping development and during production. Use data profiling to make the following types of analyses:

  • Make initial assessments.
  • Validate business rules
  • Verify assumptions.
  • Verify report validity.

Make initial assessments. You can make initial assessments about data patterns and exceptions data during mapping development. As a result, you can design mappings and workflows on actual data, rather than make theoretical assumptions about sources.

Validate business rules. You can validate documented business rules about the source data. For example, if you have a business rule requiring columns in a source table to contain U.S. ZIP codes, you can profile the source data to verify that the rows in this table contain the proper values.

Verify assumptions. You can verify that the initial assumptions you made about source data during project development are still valid. For example, you may want to view statistics about how many rows satisfied a business rule and how many did not.

Verify report validity. You can use data profiling to verify the validity of the Business Intelligence (BI) reports.


Data Mart Pros and Cons

Data Mart Pros and Cons

Independent Data Marts

Independent data marts are usually the easiest and fastest to implement and their payback value can be almost immediate. Some corporations start with several data marts before deciding to build a true data warehouse. This approach has several inherent problems:

  • While data marts have obvious value, they are not a true enterprise-wide solution and can become very costly over time as more and more are added.
  • A major problem with proliferating data marts is that, depending on where you look for answers, there is often more than one version of the truth.
  • They do not provide the historical depth of a true data warehouse.
  • Because data marts are designed to handle specific types of queries from a specific type of user, they are often not good at “what if” queries like a data warehouse would be.

Logical Data Marts

Logical data marts overcome most of the limitations of independent data marts. They provide a single version of the truth. There is no historical limit to the data and “what if” querying is entirely feasible. The major drawback to logical data marts is the lack of physical control over the data. Because data in the warehouse in not pre-aggregated or dimensionalized, performance against the logical mart will not usually be as good as against an independent mart. However, use of parallelism in the logical mart can overcome some of the limitations of the non-transformed data.

Dependent Data Marts

Dependent data marts provide all advantages of a logical mart and also allow for physical control of the data as it is extracted from the data warehouse. Because dependent marts use the warehouse as their foundation, they are generally considered a better solution than independent marts, but they take longer and are more expensive to implement.



Advantages of Using Summary Data

Advantages of Using Summary Data

Until recently, most business decisions were based on summary data. The problem is that summarized data is not as useful as detail data and cannot answer some questions with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall at the end of reporting period and are cut in half.

Here’s another example. Think of your monthly bank statement that records checking account activity. If it only told you the total amount of deposits and withdrawals, would you be able to tell if a certain check had cleared? To answer that question you need a list of every check received by your bank. You need detail data.

Decision support-answering business questions-is the real purpose of databases. To answer business questions, decision-makers must have four things:

  • The right data
  • Enough detail data
  • Proper data structure
  • Enough computer power to access and produce reports on the data


Internet and E-Business

Internet and E-Business

The Teradata Database provides a single repository for customer information that helps E-Businesses build and maintain one-to-one customer relationships that are critical to their success on the Internet. The Teradata Database supports the fast-paced style of E-Business by allowing many concurrent users to ask complicated questions as they think of them — and get quick answers.

The Teradata Database allows E-Businesses to:

  • Capture massive amounts of click-stream data.
  • Enable multiple users to ask complex questions of the customer’ click-stream data with near real-time response.
  • Protect customers’ privacy with consumer opt-in/opt-out preferences and ability for consumers to check and revise their information stored on the Teradata Database through the Internet or a company call center.