scardac¶

Waveform archive data availability collector.

Description¶

scardac scans an SDS waveform archive, e.g., created by slarchive or scart for available miniSEED data. It will collect information about

DataExtents – the earliest and latest times data is available for a particular channel,
DataAttributeExtents – the earliest and latest times data is available for a particular channel, quality and sampling rate combination,
DataSegments – continuous data segments sharing the same quality and sampling rate attributes.

scardac is intended to be executed periodically, e.g., as a cronjob.

The availability data information is stored in the SeisComP database under the root element DataAvailability. Access to the availability data is provided by the fdsnws module via the services:

/fdsnws/station (extent information only, see matchtimeseries and includeavailability request parameters).
/fdsnws/ext/availability (extent and segment information provided in different formats)

Non-SDS archives¶

scardac can be extended by plugins to scan non-SDS archives. For example the daccaps plugin provided by CAPS [3] allows scanning archives generated by a CAPS server. Plugins are added to the global module configuration, e.g.:

plugins = ${plugins}, daccaps

Definitions¶

Record – continuous waveform data of same sampling rate and quality bound by a start and end time. scardac will only read the record’s meta data and not the actual samples.
Chunk – container for records, e.g., a miniSEED file, with the following properties:
- overall, theoretical time range of records it may contain
- contains at least one record, otherwise it must be absent
- each record of a chunk must fulfill the following conditions:
  - record start < record end
  - chunk start <= record start < chunk end
  - chunk start < record end < next chunk end
- a record stored in chunk N may have an end time greater than the start time of chunk N+1 but no more than maxChunkOverlap samples should strech into the next chunk else unnecessary reads are triggered
- chunks do not overlap, end time of current chunk equals start time of successive chunk, otherwise a chunk gap is declared
- records may occur unordered within a chunk or across chunk boundaries, resulting in DataSegments marked as outOfOrder
Jitter – maximum allowed deviation between the end time of the current record and the start time of the next record in multiples of the current’s record sampling rate. E.g., assuming a sampling rate of 100Hz and a jitter of 0.5 will allow for a maximum end to start time difference of 50ms. If exceeded a new DataSegment is created.
Mtime – time the content of a chunk was last modified. It is used to
- decided whether a chunk needs to be read in a secondary application run
- calculate the updated time stamp of a DataSegment, DataAttributeExtent and DataExtent
Scan window – time window limiting the synchronization of the archive with the database configured via filter.time.start and filter.time.end respectively --start and --end. The restriction is enforced symmetrically on chunks, on individual segments and on existing database segments:
- chunks lying entirely outside the scan window are skipped,
- for chunks that straddle a scan-window boundary, only the segments inside the window are considered,
- existing DataSegments outside the scan window are left untouched in the database.
The scan window is useful to
- reduce the scan time of larger archives. Depending on the size and storage type of the archive it may take some time to just list available chunks and their mtime.
- prevent deletion of availability information even though parts of the archive have been deleted or moved to a different location
Modification window – the mtime of a chunk is compared with this time window to decide whether it needs to be read or not. It is configured via mtime.start and mtime.end repectively --modified-since and --modified-until. If no lower bound is defined then the lastScan time stored in the DataExtent is used instead. The mtime check may be disabled using mtime.ignore or --deep-scan. Note: Chunks in front or right after a chunk gap are read in any case regardless of the mtime settings.

Workflow¶

Read existing DataExtents from database.
Collect a list of available stream IDs either by
- scanning the archive for available IDs or
- reading an ID file defined by nslcFile.
Identify extents to add, update or remove respecting scan window, filter.nslc.include and filter.nslc.exclude.
Subsequently process the DataExtents using threads number of parallel threads. For each DataExtent:
1. Capture the current time as the prospective new lastScan value, so chunks modified during the scan are picked up on the next run.
2. Collect all available chunks of the stream from the archive and, if the extent already exists, load existing DataSegments inside the scan window from the database.
3. PLAN phase – for each chunk decide whether to READ or SKIP it based on the scan window, modification window and lastScan. A chunk is also re-read when no existing DB segment SPANS or STARTS within its window, so chunks reappearing in the archive with their original mtime are picked up even though that mtime is older than the previous scan. Afterwards, propagate READ to neighboring chunks whenever a database segment straddles their common boundary, so a boundary-spanning segment is either re-derived from fresh records on both sides or copied verbatim from the database on both sides, but never mixed.
4. BUILD phase – assemble the desired segment list:
  - for a READ chunk parse its records, derive chunk segments by analyzing gaps/overlaps with respect to jitter, sampling rate and quality changes, and drop chunk segments lying outside the scan window,
  - for a SKIP chunk copy database segments starting in the chunk’s window,
  - adjacent segments that are contiguous within jitter and share sampling rate and quality are merged across chunk boundaries.
5. DIFF phase – compare the desired segment list against the previously loaded database segments and derive the resulting insert, update and remove operations. Segments outside the scan window are never considered for removal.
6. Apply the collected operations to the database and recompute DataAttributeExtents and the overall DataExtent.

Examples¶

Get command line help or execute scardac with default parameters and informative debug output:
```
scardac -h
scardac --debug
```

Synchronize the availability of waveform data files existing in the standard SDS archive with the seiscomp database and create an XML file using scxmldump:

scardac -d mysql://sysop:sysop@localhost/seiscomp -a $SEISCOMP_ROOT/var/lib/archive --debug
scxmldump -Yf -d mysql://sysop:sysop@localhost/seiscomp -o availability.xml

Synchronize the availability of waveform data files existing in the standard SDS archive with the seiscomp database. Use fdsnws to fetch a flat file containing a list of periods of available data from stations of the CX network sharing the same quality and sampling rate attributes:
```
scardac -d mysql://sysop:sysop@localhost/seiscomp -a $SEISCOMP_ROOT/var/lib/archive
wget -O availability.txt 'http://localhost:8080/fdsnws/ext/availability/1/query?network=CX'
```
Note

The SeisComP module fdsnws must be running for executing this example.

Module Configuration¶

etc/defaults/global.cfg
etc/defaults/scardac.cfg
etc/global.cfg
etc/scardac.cfg
~/.seiscomp/global.cfg
~/.seiscomp/scardac.cfg

scardac inherits global options.

archive¶

Default: @SEISCOMP_ROOT@/var/lib/archive

Type: directory

The URL to the waveform archive where all data is stored.

Format: [service://]location[#type]

"service": The type of the archive. If not given, "sds://" is implied assuming an SDS archive. The SDS archive structure is defined as YEAR/NET/STA/CHA/NET.STA.LOC.CHA.YEAR.DAYFYEAR, e.g. 2018/GE/APE/BHZ.D/GE.APE..BHZ.D.2018.125

Other archive types may be considered by plugins.

threads¶

Default: 1

Type: int

Number of threads scanning the archive in parallel.

jitter¶

Default: 0.5

Type: float

Acceptable derivation of end time and start time of successive records in multiples of sample time.

maxSegments¶

Default: 1000000

Type: int

Maximum number of segments per stream. If the limit is reached no more segments are added to the database and the corresponding extent is flagged as too fragmented. Set this parameter to 0 to disable any limits.

maxChunkOverlap¶

Default: 500

Type: int

A record entirely stored in chunk N may have an end time exceeding the chunk’s time window. This parameter defines maximum number of samples overlapping the chunks end time.

The parameter is used to evaluate if a chunk needs to be read in a corner case where a chunk was moved out of the archive during a previous scan (causing surrounding segments to be split at the chunk’s boundaries) and then later moved back with its original mtime. In that situation the chunk’s mtime stays older than lastScan and no READ would be triggered otherwise.

If set to values greater than the expected samples per record unnecessary reads of chunks and possible neighbouring chunks are triggered.

nslcFile¶

Type: file

Line-based text file of form NET.STA.LOC.CHA defining available stream IDs. Depending on the archive type, size and storage media used this file may offer a significant performance improvement compared to collecting the available streams on each startup. Filters defined under filter.nslc still apply.

Note

filter.* Parameters of this section limit the data processing to either ** - Reduce the scan time of larger archives or to ** - Prevent deletion of availability information even though parts of the archive have been deleted or moved to a different location.

Note

filter.time.* Limit the processing by record time.

filter.time.start¶

Type: string

Start of data availability check given as date string or as number of days before now.

filter.time.end¶

Type: string

End of data availability check given as date string or as number of days before now.

Note

filter.nslc.* Limit the processing by stream IDs.

filter.nslc.include¶

Type: list:string

Comma-separated list of stream IDs to process. If empty all streams are accepted unless an exclude filter is defined. The following wildcards are supported: ‘*’ and ‘?’.

filter.nslc.exclude¶

Type: list:string

Comma-separated list of stream IDs to exclude from processing. Excludes take precedence over includes. The following wildcards are supported: ‘*’ and ‘?’.

Note

mtime.* Parameters of this section control the rescan of data chunks. By default the last update time of the extent is compared with the record file modification time to read only files modified since the list run.

mtime.ignore¶

Default: false

Type: boolean

If set to true, all data chunks are read independent of their mtime.

mtime.start¶

Type: string

Only read chunks modified after specific date given as date string or as number of days before now.

mtime.end¶

Type: string

Only read chunks modified before specific date given as date string or as number of days before now.

Command-Line Options¶

scardac [OPTION]...

Generic¶

-h, --help¶: Show help message.

-V, --version¶: Show version information.

--config-file file¶: The alternative module configuration file. When this option is used, the module configuration is only read from the given file and no other configuration stage is considered. Therefore, all configuration including the definition of plugins must be contained in that file or given along with other command-line options such as --plugins.

--plugins arg¶: Load given plugins.

Verbosity¶

--verbosity arg¶: Verbosity level [0..4]. 0:quiet, 1:error, 2:warning, 3:info, 4:debug.

-v, --v¶: Increase verbosity level (may be repeated, e.g., -vv).

-q, --quiet¶: Quiet mode: no logging output.

--print-component arg¶: For each log entry print the component right after the log level. By default the component output is enabled for file output but disabled for console output.

--component arg¶: Limit the logging to a certain component. This option can be given more than once.

-s, --syslog¶: Use syslog logging backend. The output usually goes to /var/lib/messages.

-l, --lockfile arg¶: Path to lock file.

--console arg¶: Send log output to stdout.

--debug¶: Execute in debug mode. Equivalent to --verbosity=4 --console=1 .

--trace¶: Execute in trace mode. Equivalent to --verbosity=4 --console=1 --print-component=1 --print-context=1 .

--log-file arg¶: Use alternative log file.

Collector¶

-a, --archive arg¶: Overrides configuration parameter archive.

--threads arg¶: Overrides configuration parameter threads.

-j, --jitter arg¶: Overrides configuration parameter jitter.

--nslc arg¶: Overrides configuration parameter nslcFile.

--start arg¶: Overrides configuration parameter filter.time.start.

--end arg¶: Overrides configuration parameter filter.time.end.

--include arg¶: Overrides configuration parameter filter.nslc.include.

--exclude arg¶: Overrides configuration parameter filter.nslc.exclude.

--deep-scan¶: Overrides configuration parameter mtime.ignore.

--modified-since arg¶: Overrides configuration parameter mtime.start.

--modified-until arg¶: Overrides configuration parameter mtime.end.

--generate-test-data arg¶: Do not scan the archive but generate test data for each stream in the inventory. Format: days,gaps,gapsLen,overlaps,overlapLen. E.g., the following parameter list would generate test data for 100 days (starting from now()-100 days) which includes 150 gaps with a length of 2.5 s followed by 50 overlaps with an overlap of 5 s: --generate-test-data=100,150,2.5,50,5

scardac¶

Description¶

Non-SDS archives¶

Definitions¶

Workflow¶

Examples¶

Module Configuration¶

Command-Line Options¶

Generic¶

Verbosity¶

Collector¶

Table of Contents

Previous topic

Next topic

This Page