|
Data Modeling Made Simple contains over
30 exercises. My responses are shown in bold italics
under each of the exercises that appear below. If you have questions or
different responses on these exercises that you would like to share with
me, please fill in the form at the bottom of this page.
Exercise 2.1
Let’s apply the modeling scope cube to our
example of the business card. Which factors of time, realm, and function
would each of the following application efforts fit?
-
A contact management
application for the marketing department to be implemented in six
months.
|
Time: |
6 months |
|
Realm: |
Process (NOTE: We might be inclined to say that the realm is the marketing
department. However, we are not modeling the entire marketing
department, but rather one of the processes within marketing: contact
management.)
|
|
Function: |
Application
|
-
An enterprise model of our
company current view verses five years from now.
|
Time: |
Today and 5 years |
|
Realm: |
Organization
|
|
Function: |
Business (NOTE: We would need to clarify whether an application view would
be important, but in general the term 'enterprise model' applies more to
business, not application. If the term 'application landscape' or
'application architecture' was used, this would be a good indication that
the function is application.) |
- An
application for formatting business cards that currently exists.
|
Time: |
Today |
|
Realm: |
Process (formatting business cards)
|
|
Function: |
Application |
Exercise 3.2
Why is it important to know whether an entity is
independent or dependent?
There are three reasons why it is
important to know whether an entity is independent or dependent. A
business, functional and technical reason. The business reason is that
the analyst is forced to ask the question "What makes this entity
unique?" That is, the analyst needs to clarify whether something else is
needed for uniqueness. The functional reason is that it leads us to
properly identify the primary keys. The technical reason is that it
tells the developer the sequence to load the data. That is, you need to
load the independent entity instances before the dependent.
Let's look at the entity student registration,
for example. The business analyst, knowing there is a difference between
independent and dependent entities, would ask the business user 'What makes
student registration unique?' The answer might include the student number
implying that student registration is dependent on student. Functionally, this
means that the primary key of student registration comes from student.
Technically, this means we must load the student table before we load the
student registration table. That is, we need to ensure that a student
registration record never arrives in the system without a valid student number.
Exercise
3.4
For which types of applications is subtyping
most useful? For which types is it least useful?
Subtyping is most useful when there is a need
for integration and application longevity. Adding the party entity for example,
makes it easier to expand the model for many types of parties not yet present in
our design such as vendors, consumers, and competitors. Types of applications
where integration is essential are data warehouses, operational data stores,
integration hubs, and the like, which are characterized by having multiple
sources and multiple targets, and the need for integration and application
longevity.
Subtyping is least useful where the
application is driven primarily by user friendliness and performance. So
for example, reporting applications such as decision support and
business intelligence types of applications are not good candidates for
subtyping.
Exercise 4.1
Part A:
What do you
think the domains for the following three data elements should be?
-
e-mail
address - 100 characters
maximum length with one '@' symbol and at least one period after the '@'
symbol. So for example me@stevehoberman.com is acceptable but
me@stevehobermancom and mestevehoberman.com are not acceptable. Note we
can make this more specific and have a lookup against the valid web
address domains such as .com, .edu, etc.
-
Web address
- 100 characters
maximum length with at least one period somewhere in the address. Note we
can make this more specific and have a lookup against the valid web
address domains such as .com, .edu, etc.
-
telephone number
- 30 characters.
Note, this would be a good length if the entire phone number is stored in
a single data element. However, we also might consider storing country code and
area code in separate data elements and therefore the actual phone number field
might be closer to 20 characters.
General note about this exercise: If you ever have difficulty identifying the
domain for a data element, see what an ERP (Enterprise Resource Planning)
application uses for the same data element and copy that. For example, I found a
30 character phone number in SAP/R3 and therefore this is a pretty safe format
to go with.
Part B:
Now that you
have domains defined for each of the foregoing three data elements,
assume you are project manager for an application that will be
populating these three data elements. In your application, how would you
like to see domain violations handled? In other words, if an e-mail
address arrives that does not belong to the domain you defined for
e-mail addresses, what kind of action would you like to see taken? Why?
It is very
dependent on the type of application that would dictate how domain violations on
these three data elements are handled. If I am designing a marketing or contact
database where addresses and phone number are relied on heavily, I might suggest
alerting someone immediately when a violation occurs. If however, these three
data elements are descriptive information about someone that may or may not ever
be actually accessed, I would most likely copy the primary keys identifying
these problem records to a separate file that someone might examine on a
periodic basis - for example, once a week.
Exercise 4.2
In the following entity, which data elements would make a good alternate
key?
In order to have any alternate key on this table, we need to make
the following three assumptions:
- If the taxpayer is a person, Social Security number must be filled
in.
- If the taxpayer is a company, federal tax id number must be filled
in.
- Legally, we can store and use Social Security number to identify
individuals.
With these three assumptions in place, we can remove the Social Security number
and federal tax id number data elements, and replace them with a taxpayer number
and individual company indicator which become the alternate key as shown in the
model below. If the taxpayer is an individual, the individual company indicator
will be set to 'I' and the taxpayer number will contain the Social Security
number. If the taxpayer is a company, the individual company indicator will be
set to 'C' and the taxpayer number will contain the federal tax id number.

Exercise 5.1
Which of the following rules can be represented
on a data model?
a)
An order is identified by an order number and must have a
valid ID from a customer and salesperson.
This can be easily shown on a data
model, such as:

b)
A high-risk property is one that has had three or more
claims filed in the last four years with a total value of more than
$50,000 in losses.
No we can't show this action rule. We
can use subtyping to highlight a high-risk property as in the following diagram,
but we cannot show this action rule on the model. Note, you can show almost any
action rule with database views, but that is outside the scope of this exercise.

c)
Monthly sales reports can be produced only for managers and
senior managers.
Yes, something like the following
would work quite nicely:

Note that in this example Manager is
an entity that contains the type of managers, not the actual manager's name.
d)
A policy can be either active or inactive.
Yes, such as the following:
Note that on a physical data model,
this situation will mostly likely be shown using a Policy Type Code.
e)
A bill of materials must contain at least five ingredients.
No we can't show a specific number
unless we do it through a label or description on the relationship, such
as the following:
f)
If you receive three speeding tickets, then you lose your
driver’s license.
No, we can't show this on the model
(unless through a database view), as this is an action rule.
Exercise 5.2
Practice reading the relationships in this model.
-
Each Associate can support zero, one
or many Customers.
-
Each Customer must be supported by one
and only one Associate.
-
Each Associate can receive commission
on zero, one or many Shipments.
-
Each Shipment must produce commission
for one and only one Associate.
-
Each Customer can place zero, one or
many Orders.
-
Each Order must be placed by one and
only one Customer.
-
Each Customer can receive zero, one or
many Shipments.
-
Each Shipment must be received by one
and only one Customer.
-
Each Order can result in zero, one or
many Shipments.
-
Each Shipment must result from one and
only one Order.
-
Each Shipment can contain zero, one or
many Shipment Lines.
-
Each Shipment Line must appear on one
and only one Shipment.
-
Each Item can appear on zero, one or
many Shipment Lines.
-
Each Shipment Line must be for one and
only one Item.
Exercise 5.3
Assume there is a rule in your company that an employee can work for one
and only one manager. A modeler you work with is working on a human
resource logical data model and has modeled the management hierarchy as
shown in the following entity. She would like to know your thoughts on
this. What are the pros and cons with using this entity to represent
this business rule?

This one has been added as a recent Design Challenge.
Click here for the
response.
Exercise 5.4
To illustrate the importance of specifying labels on relationships,
think of at least five appropriate labels that could apply to this
relationship.
-
An Employee can visit many Customers.
-
An Employee can support many Customers.
-
An Employee can bill many Customers.
-
An Employee can solicit many Customers.
-
An Employee can be the point of contact for many Customers.
Exercise 6.1
A modeler was having a difficult time getting
three departments to agree on a definition of
student. She eventually looked up
student
in the dictionary and suggested that they use this definition:
Any individual for whom
an educational institution maintains educational records.
What are the pros and cons of sticking with this dictionary definition?
The largest benefit this definition offers is
that it can meet everyone's needs. Who would argue with such a generic yet
complete definition such as this? However, the largest detriment of this
definition is that it can meet everyone's needs. The pro and con are the same in
that we are attempting to gain agreement on what a term means by using a very
generic definition. We could be masking differences by viewing the meaning at
too high a level. I would stick with this definition if student is a supertype
that will be subtyped into the different types of students such as alumni and
applicant, and then define these subtypes with sufficient detail to make them
meaningful.
Exercise
7.1
Are there any situations in which a generic
model such as in the one in fig. 7.1 would be useful?

I can see this type
of model being useful in two situations: where extreme flexibility is needed,
and as a starter model. If you are modeling an application such as a meta data
repository or ERP package (such as an SAP-like application), having this kind of
flexibility can accommodate any industry and any communication situation. You
can even capture that my pet dog Francis has a phone number with this model!
Also, this model can be useful if you want to get the high level concepts down
first before understand the types of parties and communications. More of a top
down modeling approach. First understand what a party is, and then subtype party
into customer, vendor, etc. First understand what communication is and then
subtype it down into phone number, email, etc.
Exercise 7.3
Change the model in fig. 7.2 to accommodate the
following business rule:
A
card
can contain information just for a
person
or just for a company. That is,
it is not always imperative that we have a
person
and a company. For example, we
could have Bob’s Plumbing Company appear on a business card without
having a person’s name appear along with it.
The original model is:

One solution to accommodate this business rule
is to introduce the supertype card holder:

Exercise 7.4
What disadvantages (if any) are there with
building an ASAM without first building a BSAM?
It is always advantageous to view
something within a larger context to make sure you're not missing
anything. Viewing an ASAM within the context of a BSAM ensures that all
concepts that should be within scope are represented on the ASAM. For
example, building a BSAM of an organization's order processing process
followed by an ASAM of a sales reporting application ensures that all
relevant order processing concepts are accommodated on the BSAM.
Building the ASAM without a BSAM could therefore lead to an incomplete
model.
Exercise 8.1
Do we really need “Address Sequence Number” in
customer address or could we use “Address Identifier” to sequence the address
lines instead?
We really do need the "Address Sequence
Number". If we use "Address Identifier" as the sequence number, then we
are giving meaning to a surrogate key. We are indicating that an
"Address Identifier" value of '1' means the first line, '2' means the
second line, etc. A surrogate key is a counter where the first record is
one, second record is two, and so on. There is no guarantee that the
second record would be or should be the second line in the address. We
should try hard not to use one data element for two different purposes.
Exercise 8.3
Assume that there is a need to represent a
global address structure. Create a logical data model of a complete
global-address structure. Note: you may need to go through 2NF and 3NF
before completing your model.
This will be a future Design Challenge.
Stay tuned!
Exercise 8.4
Assume that some slogans are more important than
others. For example, some of the business cards have a main slogan in
the center followed by smaller and less important slogans around the
corners of the cards. For example, assume that greater importance is
placed on “magic for all occasions” than the other six slogans on
business card 4. Adjust the model in fig. 8.7 for this change.
I would add a slogan priority sequence
number to the slogan entity, as shown here. The most important slogan on
a business card would contain the value '1', followed by next most
important with '2', and so on.

Exercise 8.5
Is there anything available on a business card
that could be used as an alternate key for
person? If not, what would make an ideal alternate key for a person?
This will be a future Design Challenge.
Stay tuned!
Exercise 8.7
Abstract all other areas you think would add
value in fig. 8.14.
The original model is:

Although abstracting gives us more flexibility, it removes readability
and rule checking from the model. So in being very cautious where to
abstract, I might limit myself to the Email, Web, and Street Address
Text data elements in Employment. I can see having a second email
address or street address and therefore this would be a good candidate
for abstraction. I might also consider abstracting the person name data
elements if the business believed there could be more than just these
two names for a person, such as a nickname. Here is the model updated
after abstracting the address text data elements:

Exercise 8.9
Abstract the following data model:
Here is one way of abstracting this model:

Exercise 9.1
What issues can occur by allowing a Slogan
Sequence Number of 1 to indicate “Not Applicable”?
The pro of expanding the domain of a
data element is that you can accommodate situations that the normal
domain would not allow. For example, expanding the time domain to values
such as "Not Applicable" would allow the model to handle situations
where you can enforce a time foreign key even though no time value is
relevant. This is also the biggest issue though. It allows us to violate
business rules through the use of creative data. If we want to get
around a business rule, we simple assign the value "Not Applicable". So
the integrity of our model is compromised, and poor data quality can be
the result if we are not careful.
Exercise 9.2
Fig. 9.4 shows sales over the last 12 months.
Assume there is a reporting need to see sales over the last 24 months,
not for the last 12 months. Would you prefer to continue to use the
model in fig. 9.4, or repeat a set of data elements for 24 months as
shown in the following entity? Explain your answer.
This will be a future Design Challenge. Stay tuned!
Exercise
9.3
Which subject area do we tend to most frequently
apply repeating data elements from?
Subject areas that represent transactions
tend to have the most repeating groups. Examples include order and
claim, and even the example in Exercise 9.2 shows order summary data in
a repeating group structure.
Exercise 9.4
Describe at least two other frequent modeling
situations where having an additional data element to distinguish entity
instances adds value. In other words, what other data element besides a
level code would be useful for distinguishing different types of entity
instances
One very frequent modeling situation is
one way of resolving subtypes. If we have non-overlapping subtypes such
as in the following example:

We can resolve the subtypes (yet still
distinguish them) through a type code, such as gender code:

Exercise 9.5
Name at least five factors we should consider
before creating a summary table.
-
Number of rows that on average are
combined. This factor reveals how useful the summary table will
be in terms of how many rows are actually summarized to gauge space
savings and performance improvements. For example, going from a day
level transaction table to a year level transaction summary might on
average go from 365 rows to 1 row. However, from day to week might
only go from 7 to 1.
-
Amount of time to build the summary
table. If it takes a significant amount of time to actually
build or refresh the table, you might consider whether it is worth
the effort. A daily level summary table that takes 16 hours to build
may not be practical or efficient.
-
Usage. How often will the
summary structure be accessed? We need to gauge how many users and
reports will be accessing the summary structure and how often they
will do so. That is, how valuable is the summarized structure?
-
Whether you need to reopen after
close of cycle.
A summary table might initially be easy to populate, but a
challenge for many developers is what to do when a transaction comes
in after the summary table is already built that needs to be
considered in the summary view. This happens more often than we
would like in situations where a transaction needs to be posted
against a prior period. Even though the "books have closed" for
example, there is a still a requirement to go back and add a $3
transaction somewhere within a $300 million summary table.
-
Data quality. Perhaps there is
no performance or usage requirement for a summary table, but perhaps
you are faced with very poor quality at a detailed level. One
solution is to summarize very granular and suspect structures into
slightly summarized yet higher quality structures. For example, if a
transaction level of granularity has data quality issues, an end of
day summary view might produce higher quality results.
-
Privacy. If a person's privacy
rights can be violated or sensitive data can be easily viewed,
summarizing can offer a level of protection. Some insurance
companies summarize claimant level data when reporting so that a
user cannot tell what person has what ailment for example.
Exercise 9.6
What is a good format and length for a surrogate
key?
As long as it is a counter that is large
enough to allow for uniqueness across an entire subject area for a very
long period of time (at least past your own retirement!), it should be
acceptable. Make sure for consistency purposes to use the same format
across all surrogate keys. For example, in our data warehouse, we use
Number(12) as our format consistently for all surrogate keys.
Exercise 9.7
Identify at least one situation in which it does
not make sense to use a surrogate key.
A state code entity, where State Code
could take on the value 'NY' and State Description Text could take on
the value 'New York'. This is a good area not to use a surrogate key
because the state code will probably never change so you don't need to
worry about stability and it is small enough to be an efficient key.
Exercise 9.8
Which of the following queries require a
dimensional model?
-
Show me the names of all alumni who have donated more than
$50,000 over the last five years.
-
Show me our donations by class, region, and year..
-
Show me our top-selling brands by region and year..
-
Show me our gross revenue by brand by region and year.
Queries
that return results that can be mathematically manipulated are
candidates for a dimensional model. So in the list above, the queries
that require a dimensional model are:
-
Show me our donations by class, region, and year.
-
Show me our gross revenue by brand by region and year
Exercise 9.9
Build an initial dimensional model for a
consumer affairs department that has the following request:
Show me the number of complaints, compliments, and questions by product,
month, and region.
Next, convert your model into a star schema,
snowflake, or starflake, depending on what you think is most
appropriate.
This will be a future Design Challenge.
Stay tuned!
Exercise 9.10
Do you think it is easier to create a
dimensional model or a relational model? Explain your answer.
Dimensional is usually easier because you
are representing what is needed and you don't need to concern yourself
(at least initially) with how the business works. It is usually harder
to represent how the business works which is the basis for a relational
model. However, if the requirements are unknown or very difficult to
capture, the dimensional model can be more difficult. On the whole
though, I have found it is easier to build dimensional models.
Exercise 10.1
Which
modeling approach would be ideal for gauging the impact of
customizing a brand new ERP package? Explain your answer.
This will be a future Design Challenge. Stay
tuned!
|