Data Modeling Made Simple Exercise Answers

Data Modeling Made Simple contains over 30 exercises. My responses are shown in bold italics under each of the exercises that appear below. If you have questions or different responses on these exercises that you would like to share with me, please fill in the form at the bottom of this page. 

Exercise 2.1

Let’s apply the modeling scope cube to our example of the business card. Which factors of time, realm, and function would each of the following application efforts fit?

 

-          A contact management application for the marketing department to be implemented in six months.

Time:  6 months
Realm: 

Process (NOTE: We might be inclined to say that the realm is the marketing department. However, we are not modeling the entire marketing department, but rather one of the processes within marketing: contact management.)

Function:  Application

-          An enterprise model of our company current view verses five years from now.

Time:  Today and 5 years
Realm: 

Organization

Function:  Business (NOTE: We would need to clarify whether an application view would be important, but in general the term 'enterprise model' applies more to business, not application. If the term 'application landscape' or 'application architecture' was used, this would be a good indication that the function is application.)

- An application for formatting business cards that currently exists.

Time:  Today
Realm: 

Process (formatting business cards)

Function:  Application

Exercise 3.2

Why is it important to know whether an entity is independent or dependent?

There are three reasons why it is important to know whether an entity is independent or dependent. A business, functional and technical reason. The business reason is that the analyst is forced to ask the question "What makes this entity unique?" That is, the analyst needs to clarify whether something else is needed for uniqueness. The functional reason is that it leads us to properly identify the primary keys. The technical reason is that it tells the developer the sequence to load the data. That is, you need to load the independent entity instances before the dependent.

Let's look at the entity student registration, for example. The business analyst, knowing there is a difference between independent and dependent entities, would ask the business user 'What makes student registration unique?' The answer might include the student number implying that student registration is dependent on student. Functionally, this means that the primary key of student registration comes from student. Technically, this means we must load the student table before we load the student registration table. That is, we need to ensure that a student registration record never arrives in the system without a valid student number.

Exercise 3.4

For which types of applications is subtyping most useful? For which types is it least useful?

Subtyping is most useful when there is a need for integration and application longevity. Adding the party entity for example, makes it easier to expand the model for many types of parties not yet present in our design such as vendors, consumers, and competitors. Types of applications where integration is essential are data warehouses, operational data stores, integration hubs, and the like, which are characterized by having multiple sources and multiple targets, and the need for integration and application longevity.

Subtyping is least useful where the application is driven primarily by user friendliness and performance. So for example, reporting applications such as decision support and business intelligence types of applications are not good candidates for subtyping. 

Exercise 4.1

Part A:

What do you think the domains for the following three data elements should be?

-          e-mail address - 100 characters maximum length with one '@' symbol and at least one period after the '@' symbol. So for example me@stevehoberman.com is acceptable but me@stevehobermancom and mestevehoberman.com are not acceptable. Note we can make this more specific and have a lookup against the valid web address domains such as .com, .edu, etc.

-          Web address - 100 characters maximum length with at least one period somewhere in the address. Note we can make this more specific and have a lookup against the valid web address domains such as .com, .edu, etc.

-          telephone number - 30 characters. Note, this would be a good length if the entire phone number is stored in a single data element. However, we also might consider storing country code and area code in separate data elements and therefore the actual phone number field might be closer to 20 characters.

General note about this exercise: If you ever have difficulty identifying the domain for a data element, see what an ERP (Enterprise Resource Planning) application uses for the same data element and copy that. For example, I found a 30 character phone number in SAP/R3 and therefore this is a pretty safe format to go with.

 Part B:

Now that you have domains defined for each of the foregoing three data elements, assume you are project manager for an application that will be populating these three data elements. In your application, how would you like to see domain violations handled? In other words, if an e-mail address arrives that does not belong to the domain you defined for e-mail addresses, what kind of action would you like to see taken? Why?

It is very dependent on the type of application that would dictate how domain violations on these three data elements are handled. If I am designing a marketing or contact database where addresses and phone number are relied on heavily, I might suggest alerting someone immediately when a violation occurs. If however, these three data elements are descriptive information about someone that may or may not ever be actually accessed, I would most likely copy the primary keys identifying these problem records to a separate file that someone might examine on a periodic basis - for example, once a week.

 

Exercise 4.2

In the following entity, which data elements would make a good alternate key?

 

In order to have any alternate key on this table, we need to make the following three assumptions:

  1. If the taxpayer is a person, Social Security number must be filled in.
  2. If the taxpayer is a company, federal tax id number must be filled in.
  3. Legally, we can store and use Social Security number to identify individuals.

With these three assumptions in place, we can remove the Social Security number and federal tax id number data elements, and replace them with a taxpayer number and individual company indicator which become the alternate key as shown in the model below. If the taxpayer is an individual, the individual company indicator will be set to 'I' and the taxpayer number will contain the Social Security number. If the taxpayer is a company, the individual company indicator will be set to 'C' and the taxpayer number will contain the federal tax id number. 

Exercise 5.1

Which of the following rules can be represented on a data model?

 

a)      An order is identified by an order number and must have a valid ID from a customer and salesperson.

This can be easily shown on a data model, such as:

b)      A high-risk property is one that has had three or more claims filed in the last four years with a total value of more than $50,000 in losses.

No we can't show this action rule. We can use subtyping to highlight a high-risk property as in the following diagram, but we cannot show this action rule on the model. Note, you can show almost any action rule with database views, but that is outside the scope of this exercise.

c)      Monthly sales reports can be produced only for managers and senior managers.

Yes, something like the following would work quite nicely:

Note that in this example Manager is an entity that contains the type of managers, not the actual manager's name. 

d)      A policy can be either active or inactive.

Yes, such as the following:

Note that on a physical data model, this situation will mostly likely be shown using a Policy Type Code.

e)      A bill of materials must contain at least five ingredients.

No we can't show a specific number unless we do it through a label or description on the relationship, such  as the following: 

f)       If you receive three speeding tickets, then you lose your driver’s license.

No, we can't show this on the model (unless through a database view), as this is an action rule. 

Exercise 5.2

Practice reading the relationships in this model.

  • Each Associate can support zero, one or many Customers.

  • Each Customer must be supported by one and only one Associate. 

  • Each Associate can receive commission on zero, one or many Shipments.

  • Each Shipment must produce commission for one and only one Associate.

  • Each Customer can place zero, one or many Orders.

  • Each Order must be placed by one and only one Customer.

  • Each Customer can receive zero, one or many Shipments.

  • Each Shipment must be received by one and only one Customer.

  • Each Order can result in zero, one or many Shipments.

  • Each Shipment must result from one and only one Order.

  • Each Shipment can contain zero, one or many Shipment Lines.

  • Each Shipment Line must appear on one and only one Shipment.

  • Each Item can appear on zero, one or many Shipment Lines.

  • Each Shipment Line must be for one and only one Item.

 

Exercise 5.3

Assume there is a rule in your company that an employee can work for one and only one manager. A modeler you work with is working on a human resource logical data model and has modeled the management hierarchy as shown in the following entity. She would like to know your thoughts on this. What are the pros and cons with using this entity to represent this business rule?

This one has been added as a recent Design Challenge. Click here for the response.

Exercise 5.4

To illustrate the importance of specifying labels on relationships, think of at least five appropriate labels that could apply to this relationship.

  • An Employee can visit many Customers.
  • An Employee can support many Customers.
  • An Employee can bill many Customers.
  • An Employee can solicit many Customers.
  • An Employee can be the point of contact for many Customers.

Exercise 6.1

A modeler was having a difficult time getting three departments to agree on a definition of student. She eventually looked up student in the dictionary and suggested that they use this definition:

 Any individual for whom an educational institution maintains educational records.

 What are the pros and cons of sticking with this dictionary definition?

The largest benefit this definition offers is that it can meet everyone's needs. Who would argue with such a generic yet complete definition such as this? However, the largest detriment of this definition is that it can meet everyone's needs. The pro and con are the same in that we are attempting to gain agreement on what a term means by using a very generic definition. We could be masking differences by viewing the meaning at too high a level. I would stick with this definition if student is a supertype that will be subtyped into the different types of students such as alumni and applicant, and then define these subtypes with sufficient detail to make them meaningful.

Exercise 7.1

Are there any situations in which a generic model such as in the one in fig. 7.1 would be useful?

I can see this type of model being useful in two situations: where extreme flexibility is needed, and as a starter model. If you are modeling an application such as a meta data repository or ERP package (such as an SAP-like application), having this kind of flexibility can accommodate any industry and any communication situation. You can even capture that my pet dog Francis has a phone number with this model!

Also, this model can be useful if you want to get the high level concepts down first before understand the types of parties and communications. More of a top down modeling approach. First understand what a party is, and then subtype party into customer, vendor, etc. First understand what communication is and then subtype it down into phone number, email, etc.

    

Exercise 7.3

Change the model in fig. 7.2 to accommodate the following business rule:

A card can contain information just for a person or just for a company. That is, it is not always imperative that we have a person and a company. For example, we could have Bob’s Plumbing Company appear on a business card without having a person’s name appear along with it.

The original model is:

One solution to accommodate this business rule is to introduce the supertype card holder:

Exercise 7.4

What disadvantages (if any) are there with building an ASAM without first building a BSAM?

It is always advantageous to view something within a larger context to make sure you're not missing anything. Viewing an ASAM within the context of a BSAM ensures that all concepts that should be within scope are represented on the ASAM. For example, building a BSAM of an organization's order processing process followed by an ASAM of a sales reporting application ensures that all relevant order processing concepts are accommodated on the BSAM. Building the ASAM without a BSAM could therefore lead to an incomplete model.

Exercise 8.1

Do we really need “Address Sequence Number” in customer address or could we use “Address Identifier” to sequence the address lines instead?

We really do need the "Address Sequence Number". If we use "Address Identifier" as the sequence number, then we are giving meaning to a surrogate key. We are indicating that an "Address Identifier" value of '1' means the first line, '2' means the second line, etc. A surrogate key is a counter where the first record is one, second record is two, and so on. There is no guarantee that the second record would be or should be the second line in the address. We should try hard not to use one data element for two different purposes.

Exercise 8.3

Assume that there is a need to represent a global address structure. Create a logical data model of a complete global-address structure. Note: you may need to go through 2NF and 3NF before completing your model.

This will be a future Design Challenge. Stay tuned!

Exercise 8.4

Assume that some slogans are more important than others. For example, some of the business cards have a main slogan in the center followed by smaller and less important slogans around the corners of the cards. For example, assume that greater importance is placed on “magic for all occasions” than the other six slogans on business card 4. Adjust the model in fig. 8.7 for this change.

I would add a slogan priority sequence number to the slogan entity, as shown here. The most important slogan on a business card would contain the value '1', followed by next most important with '2', and so on.

Exercise 8.5

Is there anything available on a business card that could be used as an alternate key for person? If not, what would make an ideal alternate key for a person?

This will be a future Design Challenge. Stay tuned!

Exercise 8.7

Abstract all other areas you think would add value in fig. 8.14.

The original model is:

Although abstracting gives us more flexibility, it removes readability and rule checking from the model. So in being very cautious where to abstract, I might limit myself to the Email, Web, and Street Address Text data elements in Employment. I can see having a second email address or street address and therefore this would be a good candidate for abstraction. I might also consider abstracting the person name data elements if the business believed there could be more than just these two names for a person, such as a nickname. Here is the model updated after abstracting the address text data elements:

Exercise 8.9

Abstract the following data model:

Here is one way of abstracting this model:

Exercise 9.1

What issues can occur by allowing a Slogan Sequence Number of 1 to indicate “Not Applicable”?

The pro of expanding the domain of a data element is that you can accommodate situations that the normal domain would not allow. For example, expanding the time domain to values such as "Not Applicable" would allow the model to handle situations where you can enforce a time foreign key even though no time value is relevant. This is also the biggest issue though. It allows us to violate business rules through the use of creative data. If we want to get around a business rule, we simple assign the value "Not Applicable". So the integrity of our model is compromised, and poor data quality can be the result if we are not careful.

Exercise 9.2

Fig. 9.4 shows sales over the last 12 months. Assume there is a reporting need to see sales over the last 24 months, not for the last 12 months. Would you prefer to continue to use the model in fig. 9.4, or repeat a set of data elements for 24 months as shown in the following entity? Explain your answer.

 

This will be a future Design Challenge. Stay tuned!

Exercise 9.3

Which subject area do we tend to most frequently apply repeating data elements from?

Subject areas that represent transactions tend to have the most repeating groups. Examples include order and claim, and even the example in Exercise 9.2 shows order summary data in a repeating group structure. 

Exercise 9.4

Describe at least two other frequent modeling situations where having an additional data element to distinguish entity instances adds value. In other words, what other data element besides a level code would be useful for distinguishing different types of entity instances

One very frequent modeling situation is one way of resolving subtypes. If we have non-overlapping subtypes such as in the following example:

We can resolve the subtypes (yet still distinguish them) through a type code, such as gender code:

Exercise 9.5

Name at least five factors we should consider before creating a summary table.

  • Number of rows that on average are combined. This factor reveals how useful the summary table will be in terms of how many rows are actually summarized to gauge space savings and performance improvements. For example, going from a day level transaction table to a year level transaction summary might on average go from 365 rows to 1 row. However, from day to week might only go from 7 to 1.

  • Amount of time to build the summary table. If it takes a significant amount of time to actually build or refresh the table, you might consider whether it is worth the effort. A daily level summary table that takes 16 hours to build may not be practical or efficient.

  • Usage. How often will the summary structure be accessed? We need to gauge how many users and reports will be accessing the summary structure and how often they will do so. That is, how valuable is the summarized structure?

  • Whether you need to reopen after close of cycle. A summary table might initially be easy to populate, but a challenge for many developers is what to do when a transaction comes in after the summary table is already built that needs to be considered in the summary view. This happens more often than we would like in situations where a transaction needs to be posted against a prior period. Even though the "books have closed" for example, there is a still a requirement to go back and add a $3 transaction somewhere within a $300 million summary table.

  • Data quality. Perhaps there is no performance or usage requirement for a summary table, but perhaps you are faced with very poor quality at a detailed level. One solution is to summarize very granular and suspect structures into slightly summarized yet higher quality structures. For example, if a transaction level of granularity has data quality issues, an end of day summary view might produce higher quality results.

  • Privacy. If a person's privacy rights can be violated or sensitive data can be easily viewed, summarizing can offer a level of protection. Some insurance companies summarize claimant level data when reporting so that a user cannot tell what person has what ailment for example.

 

Exercise 9.6

What is a good format and length for a surrogate key?

As long as it is a counter that is large enough to allow for uniqueness across an entire subject area for a very long period of time (at least past your own retirement!), it should be acceptable. Make sure for consistency purposes to use the same format across all surrogate keys. For example, in our data warehouse, we use Number(12) as our format consistently for all surrogate keys.

Exercise 9.7

Identify at least one situation in which it does not make sense to use a surrogate key.

A state code entity, where State Code could take on the value 'NY' and State Description Text could take on the value 'New York'. This is a good area not to use a surrogate key because the state code will probably never change so you don't need to worry about stability and it is small enough to be an efficient key.

Exercise 9.8

Which of the following queries require a dimensional model?

 

-          Show me the names of all alumni who have donated more than $50,000 over the last five years.

-          Show me our donations by class, region, and year..

-          Show me our top-selling brands by region and year..

-          Show me our gross revenue by brand by region and year.

Queries that return results that can be mathematically manipulated are candidates for a dimensional model. So in the list above, the queries that require a dimensional model are:

  • Show me our donations by class, region, and year.

  • Show me our gross revenue by brand by region and year

 

Exercise 9.9

Build an initial dimensional model for a consumer affairs department that has the following request:

 

Show me the number of complaints, compliments, and questions by product, month, and region.

 

Next, convert your model into a star schema, snowflake, or starflake, depending on what you think is most appropriate.

This will be a future Design Challenge. Stay tuned!

Exercise 9.10

Do you think it is easier to create a dimensional model or a relational model? Explain your answer.

Dimensional is usually easier because you are representing what is needed and you don't need to concern yourself (at least initially) with how the business works. It is usually harder to represent how the business works which is the basis for a relational model. However, if the requirements are unknown or very difficult to capture, the dimensional model can be more difficult. On the whole though, I have found it is easier to build dimensional models.

Exercise 10.1

Which modeling approach would be ideal for gauging the impact of customizing a brand new ERP package? Explain your answer.

This will be a future Design Challenge. Stay tuned!