Is having large records bad?

josh · November 2019

I have been told that having large records is bad (i.e., records with lots of data in them). Is this true?

What I am trying to do:

We have a table called CUSTOMER_NOTES. Each customer can have 1 or many notes, and they're all stored in this table. A note contains a fairly long string (let's say 2000 characters), among other things.

A note can be replied to 1 or many times. At the moment, we store the replies to a note in the original note's record (multi-value). But I've been told that this is bad, because the record's size will grow too large. Is this true? The proposed solution to get around this is to create a new record for each reply, and link a reply to the note it's replying to (though a pointer field)

For example:

Note 1
blah blah this is note 1

Note 2
blah blah blah this is note 2

Solution 1: Store both note 1 and note 2 in the same OI record as separate values (multi-value)
Solution 2: Store note 1 and note 2 in separate records, but store a reference to 1 in 2, and a reference to 2 in 1.

For me, solution number 1 is the best, as it makes use of OI's multi-value feature, and it reduces the number of reads my application has to do (reads in OI are extremely slow (to the point of being broken))*

* I am not sure why this is. I am guessing it's because the records are stored by OI far away from each other on the hard disk, so when you read records (keys), you need to fetch each one from a different "place" (page) on the disk, rather than one read getting a whole bunch.

DonBakke · November 2019

I have been told that having large records is bad (i.e., records with lots of data in them). Is this true?

Questions like these are usually hard to generalize. "Large" can mean different things to different people. My first consideration what is the best design architecture and then I consider constraints (such as network lag, slow disks, etc.) and then try to optimize my design around those.

Large records can be bad. I would also argue that a record that gets sufficiently large probably needs a design overhaul. Regardless of the way you design the system you will want to configure the database table so it will handle your data as efficiently as possible. This is where you provide the Table Builder the Average Size, Frame Size, and Threshold information. You could also consider presizing the table with a sizelock that prevents shrinking.

A note can be replied to 1 or many times. At the moment, we store the replies to a note in the original note's record (multi-value). But I've been told that this is bad, because the record's size will grow too large. Is this true? The proposed solution to get around this is to create a new record for each reply, and link a reply to the note it's replying to (though a pointer field)

Just because you can use a MultiValue field doesn't mean you should. I would design this so each note gets its own record. This would be a design architecture decision. Records that maintain data like this have a tendency to suffer from lock contention. I don't know what kind of use cases can exist, but if is possible for multiple users to add notes to the same customer at the same time then you'll want each note to be a new record so no one gets locked out. I would then create a relational index so that each child note relates back to the parent. This will automate the pointer system.

(reads in OI are extremely slow (to the point of being broken))*

* I am not sure why this is. I am guessing it's because the records are stored by OI far away from each other on the hard disk, so when you read records (keys), you need to fetch each one from a different "place" (page) on the disk, rather than one read getting a whole bunch.

So your preference is based on platform related issues. This is one reason why it is hard to generalize. The engineer in me would really want to understand the network layout better and work toward a resolution. The benefits of this are far reaching.

josh · November 2019

Yes, my preference is based on performance. Otherwise, solution 2 would be perfect.

But i think if I limit the amount of data returned to the last x amount days of notes by default, it won't be too slow. But it's still worrying that the system is going to get slower and slower with time.

What I am trying to do:

Create an OI form that when opened, displays a subset of all of a customer's notes, which the user can browse through, respond to, add another to, etc (like an email client)
It's unlikely that a customer would have so many notes within the last x days that the form would take a noticeably long time to read them all.

We have 5 million customers, and on average, a customer only has 9 notes...which is nothing. And that's 23 years worth of data.

We also store the data in Sybase, so I'm probably just going to read the data from there. OI is just so slow. reading 1000 records takes more than 3 seconds (hence my preference for creating as few records as possible*). Do you know if OI 10 fixes this bug?

*reading a lot of records is slow, but reading one record with lots of data in it is fast. That's the infuriating thing about OI.

Also note:
We're not creating this from scratch. The table already exists, and currently has 5 million records in it. Creating an index on a table like this sounds scary, as in our system, indexes on large tables have a tendency of breaking. Having to rebuild an index on a table with 5 million + records in it would take at least 7 hours....scary

josh · November 2019

@DonBakke

I would then create a relational index so that each child note relates back to the parent

What is the difference between a relation index, and a btree index?

BarryStevens · November 2019

CUSTOMER_NOTES has mv field CUSTOMER_NOTES_REPLIES_KEY

CUSTOMER_NOTES_REPLIES has field CUSTOMER_NOTES_KEY

A relational index setup will maintain keys in CUSTOMER_NOTES_REPLIES_KEY for all records created in CUSTOMER_NOTES_REPLIES relating to the speicific customer note.

to browse all replies to a note just throw the CUSTOMER_NOTES_REPLIES_KEY field at the CUSTOMER_NOTES_REPLIES screen (CreateParam)

DonBakke · November 2019

Btree indexes (and its close cousin the Cross-Ref index) just creates a reverse lookup of data. You still need to do an index search to find a match. The the associated source Key IDs are then returned.

A relational index maintains a relationship between a source and target table using a "join" relationship that you define. The end result is that all relevant child Key IDs are automatically maintained in a target column in the source table. Thus, there is no searching needed. You just pull the list of Key IDs and you automatically have all of your children.

BarryStevens · November 2019

Yep, what he said

josh · November 2019

OK thanks for clarifying, that's not how I was going to do it, but now i will consider it.

My original idea was to store everything thing in one table, and link notes via pointer fields...

Isn't one table simpler than 2 though?

DonBakke · November 2019

Actually, you can use the same table as the source and target for a relational index. We do this specifically when we want to create hierarchical structures. I think your notes/discussion database is well suited to this design structure. However, in order for it to work, you need to utilize multi-part Key IDs and calculated columns. It's beyond the scope of this discussion thread to break this down for you. It is the subject of a future blog article we want to publish.

That said, I still prefer a design where the data resides in one table and the relational indexes (aka pointers as you put them) in a different table. This will help avoid lock contention.

BarryStevens · November 2019

@josh

here is an article related to Don's 'relational indexes in a different table'

This principal applies also to OI9

DonBakke · November 2019

@josh something of note in that article Barry linked you to is an indirect argument for NOT using MultiValue fields to track the comments. By doing this, it allows each record in the table to have very different sizes. This can lead to a "skew[ing of] the hashing on the primary table."

josh · November 2019

Ok thank you for your help. I will discuss these ideas with my team and see what they say.

josh · November 2019

But I will mention one more thing, because it might impact the design.

We want to create a form that works like gmail. When you open the form, it will display a preview of all of customer's notes within the last 120 days in a grid. But we only want to show the most recent reply, like gmail. When you click on a row, it will load in a panel on the right the full reply chain and other info.

DonBakke · November 2019

So are you asking if your UI design would have an impact on the database design?

BarryStevens · November 2019

@josh if you use a sequential number system for the reply records, then one of the relational index params is a sort sequence "DR" descending right justified - this will work for .

josh · November 2019

@barry

Let me explain more.

This company has been around for 23 years, and the notes system has been in place since then. So far, we have 5 million note records. The key to a note is:

customer_number*note_no

The note_no is a sequential key. We store the last note_no for a customer, so getting the last 120 days worth of notes is easy (read down from the last note no).

josh · November 2019

@DonBakke

yes.

Le t me explain more of the context here.

We are remaking my company's callcentre system in OI. At the moment, it's a javascript app. As part of the redesign, I want to "fix" customer notes. This is the problem with the current system:

A reply to a note is stored within the note itself, which is bad for the reasons you outlined, but also bad because we enforce a limit on the size of the record. This means that users are prevented from writing more replies once the size limit has been reached, which infuriates them.

My idea to fix this was to make each reply its own note, but linked to the original note (through a pointer field, or as you suggest a relational index).

Current system design:

The key to the CUSTOMER_NOTES table is customer_number*note_no. So the key is a multi-part key. Also, the note_no is sequential, and we store the last note_no for each customer.

josh · December 2019

I actually don't think OI is a slow as I said.

When I run OI on my workstation, reads and btree.index are really slow. But when I run OI on the same computer that the "oi server" is running on (not really sure what OI server means -- to be honest i don't understand the client-server architecture of oI, if someone could explain it that would be useful) it's fast.

DonBakke · December 2019

When I run OI on my workstation, reads and btree.index are really slow. But when I run OI on the same computer that the "oi server" is running on...

That's what I would have expected. The problem isn't with OI or the server, but with your network. This is why my first response in this thread was this:

The engineer in me would really want to understand the network layout better and work toward a resolution. The benefits of this are far reaching.

...(not really sure what OI server means -- to be honest i don't understand the client-server architecture of oI, if someone could explain it that would be useful) it's fast.

This means you are running OI on the same server where the OI binaries (and presumably database tables) are stored. This implies you aren't routing anything through the network. OI isn't really client-server, which is why it is a bit hard to understand if you are familiar with traditional client-server systems. When you run OI from a workstation that is connected to the server via a network, you are running everything over the wire. This is why network latency can have such a huge impact upon OI performance.

josh · December 2019

Hi, but why is OI particularly bad over a network? Is this an inherent flaw in OI, or a configuration setting?

DonBakke · December 2019

It seems like you aren't reading me carefully. I never said OI is particularly bad over a network. I think I made it clear that the problem is with the network, not OI. OI can (and does) run very well over a network but that assumes a lot of things that are too numerous to go over comprehensively in a discussion forum. Your network could have latency problems due to poor cabling, multiple hops, packet prioritization, firewall filtering, anti-virus targeting, etc. There could also be some very poor design in the OI application itself. You might want to recommend that you have someone with deep knowledge of OI and networks analyze your environment to help you identify potential bottlenecks.

josh · December 2019

Ok thanks, now that you have explained that it's not necessarily OI but possibly the network, I will try to figure out why this is happening. But my point was that other apps that use the network are fast, that's why I thought there was something wrong with OI in particular -- but as you say, OI is particularly dependent on the network, so it would be particularly affected by a "slow" network.

daveh · March 2020

For the purpose as described I just use keys where there is a prefix, like the CUSTOMER_ID, and a suffix that is a sequence number. They are separated by the * as PART1 and PART2 as multipart keys do. The notes records are as long as need by for a topic, or for each event. If you want to relate them to one another just include a field in the row that contains the keys of those related. I don't use an OpenInsight index, I just gather up the keys, starting at 1 and increment until I have the most recent. Why burden with a relational since the keys have no meaning except for separating the rows?

DonBakke · March 2020

I'm not sure how your suggestion of including "a field in the row that contains the keys of those related" is fundamentally different from what a relational index is.

daveh · March 2020

It isn't. I just don't have OI manage that list.

DonBakke · March 2020

I see. I think there are pros and cons to using relational indexes in this situation. I used to be a firm disbeliever in relational indexes (mostly due to the 64K constraint in AREV and OI 16bit) but then became a convert after seeing them used in very creative and helpful ways (something I intend to write a blog article about eventually).

One advantage of using relational indexes is that the system just maintains the list of keys for you. Relational indexes are updated immediately unless there is a lock on the source row. Then it just gets queued up like Btree and Cross-Reference indexes and updated as soon as possible.

daveh · March 2020

I just LOVE using Relational Indexes over the Btree flavors, where appropriate. As an "old-timer" I wrote in Rev before Arev introduced the Btree and Cross-Reference secondary index types. The drawback to them is if you dare request a "Rebuild all indexes of a table". On large files the process can take all day and shouldn't be done while the table is in use since it clears every Related To values before it does the rebuild. It would be nice to have a "Rebuild all Btree and Cross-Reference indexes of a table" in the Index menu.

Is having large records bad?

Comments