Data Management

CAPS uses the SDS directory structure for its archive. As shown in figure 8. SDS organizes the data by year, network, station and channel. One file is used for each day of the year. This tree structure eases archiving of data. One complete year may be moved to external storage, e.g. tape libraries.

../_images/sds.png

SDS archive structure

File Format

CAPS uses the RIFF file format for data storage. A RIFF file consists of chunks. Each chunk starts with a 8 byte chunk header followed by data. The first 4 bytes denote the chunk type, the next 4 bytes the length of the following data block. Currently the following chunk types are supported:

  • SID - stream ID header
  • HEAD - data information header
  • DATA - data block

Figure 9 shows the possible structure of an archive file consisting of the different chunk types.

../_images/file_one_day.png

Possible structure of an archive file

SID Chunk

A data file may start with a SID chunk which defines the stream id of the following data. In the absence of a SID chunk, the stream ID is retrieved from the file name.

content type bytes
id=”SID” char[4] 4
chunkSize int32 4
networkCode + ‘\0’ char* len(networkCode) + 1
stationCode + ‘\0’ char* len(stationCode) + 1
locationCode + ‘\0’ char* len(locationCode) + 1
channelCode + ‘\0’ char* len(channelCode) + 1

HEAD Chunk

The HEAD chunk contains information about subsequent DATA chunks. It has a fixed size of 15 bytes and is inserted under the following conditions:

  • in front of first data chunk (beginning of file)
  • packet type changed
  • unit of measurement changed
content type bytes
id=”HEAD” char[4] 4
chunkSize (=7) int32 4
version int16 2
packetType char 1
unitOfMeasurement char[4] 4

The packetType entry refers to one of the supported types described in section Packet Types.

DATA Chunk

The DATA chunk contains the actually payload, which may be further structured into header and data parts.

content type bytes
id=”DATA” char[4] 4
chunkSize int32 4
data char* chunkSize

Section Packet Types describes the currently supported packet types. Each packet type defines its own data structure. Nevertheless CAPS requires each type to supply a startTime and endTime information for each record in order to create seamless data streams. The endTime may be stored explicitly or may be derived from startTime, chunkSize, dataType and samplingFrequency.

In contrast to a data streams, CAPS also supports storing of individual measurements. These measurements are indicated by setting the sampling frequency to 1/0.

Optimization

After a plugin packet is received and before it is written to disk, CAPS tries to optimize the file data in order reduce the overall data size and to increase the access time. This includes:

  • merging data chunks for continuous data blocks
  • splitting data chunks on the date limit
  • trimming overlapped data

Merging of Data Chunks

CAPS tries to create large continues blocks of data by reducing the number of data chunks. The advantage of large chunks is that less disk space is occupied by data chunk headers. Also seeking to a particular time stamp is faster because less data chunk headers need to be read.

Data chunks can be merged if the following conditions apply:

  • merging is supported by packet type
  • previous data header is compatible according to packet specification, e.g. samplingFrequency and dataType matches
  • endTime of last record equals startTime of new record (no gap)

Figure 10 shows the arrival of a new plugin packet. In alternative A) the merge failed and a new data chunk is created. In alternative B) the merger succeeds. In the latter case the new data is appended to the existing data block and the original chunk header is updated to reflect the new chunk size.

../_images/file_merge.png

Merging of data chunks for seamless streams

Splitting of Data Chunks

Figure 11 shows the arrival of a plugin packet containing data of 2 different days. If possible, the data is split on the date limit. The first part is appended to the existing data file. For the second part a new day file is created, containing a new header and data chunk. This approach ensures that a sample is stored in the correct data file and thus increases the access time.

Splitting of data chunks is only supported for packet types providing the trim operation.

../_images/file_split.png

Splitting of data chunks on the date limit

Trimming of Overlaps

The received plugin packets may contain overlapping time spans. If supported by the packet type CAPS will trim the data to create seamless data streams.

Packet Types

CAPS currently supports the following packet types:

  • RAW - generic time series data
  • ANY - any possible content
  • MiniSeed - native MiniSeed

RAW

The RAW format is a lightweight format for uncompressed time series data with a minimal header. The chunk header is followed by a 16 byte data header:

content type bytes
dataType char 1
startTime TimeStamp [11]
     year int16 2
     yDay uint16 2
     hour uint8 1
     minute uint8 1
     second uint8 1
     usec int32 4
samplingFrequencyNumerator uint16 2
samplingFrequencyDenominator uint16 2

The number of samples is calculated by the remaining chunkSize divided by the size of the dataType. The following data types value are supported:

id type bytes
1 double 8
2 float 4
100 int64 8
101 int32 4
102 int16 2
103 int8 1

The RAW format supports the trim and merge operation.

ANY

The ANY format was developed to store any possible content in CAPS. The chunk header is followed by a 31 byte data header:

content type bytes
type char[4] 4
dataType (=103, unused) char 1
startTime TimeStamp [11]
     year int16 2
     yDay uint16 2
     hour uint8 1
     minute uint8 1
     second uint8 1
     usec int32 4
samplingFrequencyNumerator uint16 2
samplingFrequencyDenominator uint16 2
endTime TimeStamp 11

The ANY data header extends the RAW data header by a 4 character type field. This field is indented to give a hint on the stored data. E.g. an image from a Web cam could be announced by the string JPEG.

Since the ANY format removes the restriction to a particular data type, the endTime can no longer be derived from the startTime and samplingFrequency. Consequently the endTime is explicitly specified in the header.

Because the content of the ANY format is unspecified it neither supports the trim nor the merge operation.

MiniSeed

MiniSeed is the standard for the exchange of seismic time series. It uses a fixed record length and applies data compression.

CAPS adds no additional header to the MiniSeed data. The MiniSeed record is directly stored after the 8-byte data chunk header. All meta information needed by CAPS is extracted from the MiniSeed header. The advantage of this native MiniSeed support is that existing plugin and client code may be reused. Also the transfer and storage volume is minimized.

Because of the fixed record size requirement neither the trim nor the merge operation is supported.