Harvesters

Section 6.4: Harvesters

If you are a site administrator you can create entry harvesters that will scan either the server file system or will fetch web-based resources and automatically ingest them into the repository. There is a simple introduction to the File Harvester here.

Go to the Harvesters tab under the Admin page. This lists all of the Harvesters that are created and allows you to create new Harvesters. Enter a Name for a new Harvester, select a type (e.g., "Local Files" for a File Harvester) and press Create and you will be taken to the Edit form for the new Harvester. From the main Harvesters tab you can always bring up this edit form for existing Harvesters.

The File Harvester lets you specify a directory root on the server file system and a regular expression pattern that is used to match file names. Any files that are found are then added to the repository. The repository is checked so as to not add duplicate files.

6.4.0 Run Settings

The form has settings for running the harvester. When you are first creating a harvester sometimes it may takes some time to figure out just what you are harvesting and the name and folder settings for the repository entry. So, its good to turn on test mode. This will result in entries not being added to the repository when you run the harvester. Rather, when in test mode, up to "Count" number of files will be found and the results will be listed in the "Information" section of the harvester page.

The "Active on startup" flag, when set, results in the harvester being started when the repository starts up. The "Run continually" flag has the harvester continually run. It uses the "Every" setting to determine the pauses between runs. You can choose Absolute time to pause every N minutes. Or, you can choose "Minutes" or "Hourly" to have it run relative to the hour or the day, e.g. "3 hourly" will run at 0Z, 3Z, 6Z, 9Z, etc.

For example, if you know you are getting data files in real-time that are coming in every 30 minutes you could set your harvester to run in "Absolute" mode every 15 minutes. If you had a Web harvester that is fetching images you might want to use an "Hourly" setting to get the image at some fixed interval (e.g., 0Z, 6Z, 12Z, 18Z, etc).

6.4.1 Files Settings

Under "Look for files" you specify a directory on the server file system to scan and a regular expression to match on the file name. The repository will recursively scan the directory tree and any files it finds that matches the pattern it will add to the repository.

The regular expressions used are somewhat extended in that you can specify subsets of the regular expression and use the result text for metadata and other information when creating the entry in the repository. For example, a very common case is to have a date/time embedded in the filename. So, you could have in your regular expression something of the form:

.*data(fromdate:\d\d\d\d\d\d\d\d_\d\d\d\d)\.nc

This would match any files of the form:

data_yyyymmdd_hhmm.nc

The "(" and ")" define the sub-expression (just like normal regular expression). But the "fromdate:" is the special extension that tells the harvester that that sub-expression is used to create the repository entry fromdate field.

The date format that is used is defined in the Date Format field and follows the Java date format conventions.

If you are creating entries of a certain type that has a number of attributes you can extract the attribute values using this extended regular expression technique. For example, if you had an entry with two attributes attr1 and attr2 and your files were of the format:

<attr1>_<attr2>.csv

Your regular expression would be:

(attr1:[^/]+.)_(attr2:[^/]*).csv

This says that attr1 is any number of characters except the slash ("/"). The slash exclusion is used to exclude the file path as the full file path is used when matching patterns. The value for attr2 follows the "_" and is any number of characters except a slash.

6.4.2 Entry Creation

When creating an entry we need to know the folder to put it under, its name and description. You specify templates for these that can contain a set of macros (see below). Note: this is where the Test mode described above is useful. Sometimes it takes a while to figure just what you want in terms of folder structure and entry names.

To define the folder you need to select an existing base folder and then optionally specify a folder template. The folder template is used to automatically create a new folder if needed. So for example, if your base folder was: Top/Data and your Folder Template was: Ingested/Satellite then the result folder would be:

Top/Data/Ingested/Satellite

The Harvester would create the Ingested and the Satellite folders as needed.

The name, description and folder templates all can contain the following macros. Note: The different date fields (e.g., create_, from_ and to) refer to the create date/time, the from data time (which defaults to the create date unless specified in the pattern) and the to data time.

${filename}	The file name (not the full path)
${fileextension}	The file extension
${dirgroup}	See below
${create_date},${from_date}, ${to_date}	The full formatted date string
${create_day}, ${from_day}, ${to_day}	The numeric day of the month
${create_week}, ${from_week}, ${to_week}	The numeric week of the month
${create_weekofyear}, ${from_weekofyear}, ${to_weekofyear}	The numeric week of the year
${create_month},${from_month}, ${to_month}	Numeric month of the year
${create_monthname},${from_monthname}, ${to_monthname}	Month name
${create_year}, ${from_year}, ${to_year}	Numeric year

The dirgroup macro is the parent directories of the data file up to but not including the main directory path we are searching under. For example, if you are looking under a directory called "/data/idd" and that directory held sub-dirs:

/data/idd/dir1/data1.nc
/data/idd/dir1/dir2/data2.nc

Then when ingesting the data1.nc file its dirgroup value would be:

dir1

When ingesting the data2.nc file its dirgroup value would be:

dir1/dir2

Another common way of defining the folder is to use the date macros. For example a folder template of the form:

${from_year}/${from_monthname}/Week ${from_week}

Would result in folders like:

2009/January/Week 1
2009/January/Week 2
...
2009/March/Week 1
2009/March/Week 2

You can also name the entrys using the macros. So, using the above date based folder template you could then have a Name template that incorporates the formatted date:

Gridded data - ${from_date}

The Move file to storage checkbox allows you to determine whether the file is to be moved from its initial location to the RAMADDA storage area.
Note: If the file is not moved to the storage area than one of the data directories the file lies under needs to be added to the list of file system directories in the Admin->Access area

6.4.3 Web Harvesters

The Web Harvesters work the same way as the File Harvesters but they fetch a URL (e.g., an image) every time they run. You can also define more that one URLS to fetch . The basic Run settings, Folder and entry creation mechanisms are the same as described above.