HDF-EOS WORKSHOP II EXPERTS PANEL
September 23, 1998

 

[Note: some extraneous "banter" has been removed from this transcript.]

Experts Panel:
Listing of Questions

Candace Carlisle ESDIS Moderator

Question 1: Source Code for GCTP Routines

Dan Marinelli ESDIS Science Systems Development Office

Question 2: HDF-EOS File Extension

Mike Folk NCSA HDF Development

Question 3: Naming Geolocation Fields

Doug Ilg RSTX HDF-EOS Consulting

Question 4: Missing Data Values

Siri Jodha Khalsa ECS/SAC Metadata

Question 5: TAI

Larry Klein ECS/SAC HDF-EOS Development

Question 6: Flushing the Buffer During Malfunctions

Dave Winn ECS/SAC HDF-EOS Development

Question 7: Impact of PI Processing

Tim Gubbels ECS Science Office

Question 8: Internet Access of EOS Data Sets

Question 9: Impact of HDF5 on HDF-EOS

Question 10: Variable Length Records

Question 11: Filters/Conversion Packages

Question 12: User-Developed Software

Question 13: Large Objects

Question 14: Getting Data into an HDF Granule

Question 15: Metadata Tools

Candace Carlisle: Now this session is being taped, so fair warning to all. We are planning to get a transcript of this session and post it on the Web. So if you do ask questions, you need to use your mike. And I found, we did this last year, and I found you don't sound nearly as profound once you're on the Web as you thought you did when you were talking, but we're going to try it anyway, because I think it was useful to let people know the information. So—oh yes, we'd also like, if you would announce your name when you ask your question, so that it's recorded for all time. Now our format here today is we've had some questions that were submitted in advance, and we're going to answer those first, and we went through at lunch and debated these at great length, and the science people to answering them, and we're going to answer these questions first, and then if there's time left over then we'll entertain questions from the audience. So just going in a random order here.


—Question 1: Source Code for GCTP Routines—

Carlisle: The first one's a suggestion, and I got this from Ted Meyer from Fortner, and it says, "Make the source code for the GCTP routines available in source so that vendors can support other platforms in binary libraries; i.e., Mac OS, Dynamic Link Libraries," and Larry Klein is going to address that.

 

Larry Klein: Because it was a historical artifact, we only distribute the binaries for the projection library with HDF-EOS, but it's the same, comes from the same source that's distributed with the toolkit. So if anybody wants the source they can just download the SDP toolkit. Does that answer your question, Ted? You probably already have it.

—Question 2: HDF-EOS File Extension—

Carlisle: OK, let's move on. The next one I have is another suggestion that I wrote down that Andy Pursch—did I say that right?—from RSI made, and he suggested that we use a file extension of, for example, .eos, on HDF-EOS files, and Doug Ilg is going to address that.

 

Doug Ilg: I think we'd rather not actually define an extension, any particular extension, to identify HDF-EOS files because, as I said in my presentation earlier, it's really, it doesn't make any sense to ask, to ask whether it's an HDF-EOS file or not, it's individual objects within the files that are HDF-EOS or not. So I think what we need to do is open all these files as HDF files, and there's also no standard extension for HDF files. People commonly use hdf but it's not required—.hdf—so what you need to do is open them up and then you can find out from the contents of the file if there are HDF-EOS objects in there. You could use the uh, I forget what the names of the routines are, they were discussed, Steve Eddins was talking about them in the Matlab demonstration, that will list the number of grids and numbers of points or swaths in the file. If there are zero, then you don't have to deal with them.

Audience member: Just a quick question. Would you feel the same way about assigning a MIME type to it? Should it just be the standard HDF MIME type?

Ilg: I think so, yes.

Carlisle: Any other discussion on this one? OK, let's move on.

 

—Question 3: Naming Geolocation Fields—

Carlisle: The next question is from Linda Hunt, at NASA Langley DAAC, and it's a long question. "Is there any hope," she says, "is there any hope for having more flexibility in naming geolocation fields? Currently, in order to be subsettable, the only geolocation field names are latitude or colatitude, longitude, and time. But there are occasions when there are multiple lat/long, or time variables associated with, for example, a swath, and it may be desirable to subset by any of them. It would be desirable to either 1) be able to specify in the subset call defining the region or time the names of the fields to be used, or 2) have some way to map the actual field name to be used to a required field name, for example, make latitude a reference to the desired latitude geolocation field." And Dan Marinelli and Larry Klein are going to answer this one.

Klein: Number one is going to be pretty hard. Trying to do subsetting on a general field, which we don't know exactly how to specify it, needs a lot, we need to resp–, you know, having these few names enabled us to focus it enough so we could guarantee a generalized subsetting, but having it, trying to do it on a general, on a just an arbitrary field, would be pretty hard.

Linda Hunt: I actually wasn't referring to any arbitrary field, but say, a latitude field where the values are latitude values, but because there is more than one latitude, for example, the infamous CERES, example CERES, has latitude, or colatitudes in their case, top of the atmosphere for the parameters they measure there, and ground-based for the parameters, the surface parameters. So there would be occasions when you'd want to subset by top of the atmosphere mapped latitudes or the ground-based ones, and they're all latitudes, they're all, or colatitudes, 0 to 180, in their case. So it's not like it's any arbitrary field with any arbitrary values.

Klein: OK, so there's only geolocation, then—that's more meaningful, I can do that. Actually implementing it...

Dan Marinelli (?): Yeah, number two is feasible, actually, and it could be put on our list of things to do, but we couldn't make any promises on when it would get accomplished.

Hunt: Yeah, because that way, if you could map, say I want latitude at the top of the atmosphere to be the latitude parameter for subsetting, and just map that to it, that would satisfy the ability to use the tools.

Marinelli (?): Right. So there is hope.

Hunt: There is hope.

Carlisle: OK, is there any more discussion on that question?

 

—Question 4: Missing Data Values—

Carlisle: OK, the next set of questions is from Cheryl Craig from the MOPITT team. The first one: "Is there a suggested way of storing missing data values yet? For example, attributes called missing xxx." And Larry Klein was going to answer that one.

Klein: Yeah, there's a lot of ways to do that. You know, we, we thought about this before, and we started stumbling, because there's lots of ways to establish missing values, so we don't want, if we, if we say what a routine should be then we would be establishing a standard for doing that, and we don't want to do that. If there were some sort of more community consensus, you know, for how, how the you know, the community would like it, then we could pull down a routine. But it would be hard on, I guess you guys came up with a way to do it, but we don't know that that's, you know, acceptable to everybody. We don't want to be in a position of establishing a standard until we know...Yep.

Cheryl Craig: I guess the point was that for tools out there to recognize a missing value and not plot them would be very useful. I know EOSview allows you now to filter it out if it's at the edge of your range, but if it happens to be in the middle of your range, it will still be plotting your missing value. You know, if you pick zero, and you've got data on both sides, you're still going to be plotting the zero points. I guess my thought was that if there were an attribute in an HDF-EOS file or set of attributes, I know, I believe when we first pursued this years ago, I think it was Doug Ilg had sent out an e-mail saying that their suggestion was a missing underscore, whatever, for no data or illegal or whatever, and then you just have a list of these things that would then have set values, and then end users know that they look for attributes that start missing whatever and those are the values they need to throw out. You know, if this were kind of adopted across the board, then you know, end users will know immediately. If they have to go back to documentation to find out what the missing values are, then their, you're not taking full advantage of the HDF and HDF-EOS self-describing.

Lee Elson: Yeah, just to follow up on that a little bit. Is it possible to suggest a default attribute rather than a default missing data value? Is that what's being talked about here? I think that's a very good idea. Also, I think it's worthwhile to make a distinction between missing data and no data. Because, for example, an oceanography remote sensing instrument would never even consider something over the land as being missing data, but there may be places over the ocean where there are, the instrument didn't take data. Some people like to make distinctions in, between those two.

Klein: That can be handled on its own. There is no standard ??? (inaudible).

Ilg: Actually, Lee, what you've hit on there is exactly the reason why we didn't want to set, a particular set of attribute names, is that every discipline has its own set of ways of describing missing, or no data. There's—I've heard of at least 10 or 12 different ones just in the discussions I've had. So that's why I suggested to Cheryl that maybe just "missing" and then it's only, sort of an informal thing. Maybe if you just put "missing," underscore, and then describe what, however, you know, the reason it's missing: "missing_no_data", "missing_data_drop", something like that. But at least that would be a human-understandable way of doing it, not necessarily a systematic way of doing it.

 Khalsa: This is related to an issue that came up earlier. If we define an attribute called "missing," and we include that attribute as a grid attribute, that attribute would then be applicable to all the fields in that grid, which may not be appropriate. So then we would also have to add the field name to the missing value. It becomes a conventions issue, so there is not a simple answer.

Carlisle: We did also discuss at the lunch, I think, that if somebody wanted to write up a recommended approach to this, that we could, we'd be happy to kind of sponsor it, and publish it. Like we have a recommended approach to how to do your data sets in such a way that subsetting is easier, but plenty of people have chosen not to follow it. We wouldn't mind having a recommended approach to missing data values that if people chose to follow it might allow them to get some more functionality out of some of these tools. So if, and we think that maybe CERES and MOPITT have taken different approaches to this, and maybe some other instrument teams may have taken yet different approaches. But if it was something like User Services Working Group, wanted to work, as an idea.

Ted Meyer: Can the project take, this is Ted, by the way. Can the project take on a, the responsibility to request from two or three of the instrument teams, or more, to get together with maybe Doug, to produce a recommended approach and then produce a paper or a document that describes it?

Pedro Vicente: My name is Pedro Vicente. I'm not sure if I understood that right about that default value for the missing value. I have some experience in that field because where I used to work we worked a lot with those field values too, and well, I repeat, I'm not sure if I understood right, but I think it's not a good idea to have a default field value, because that gives, takes away freedoms to the user, that they want to implement their own field values. For example, at the place where I work we implement –99 for land values, and other places can have other different values. So anyway, I think it's not a good to have a default field value.

 Khalsa: Going back to the data model, you can define quality flags in a PSA.. There's no reason that you couldn't define a product-specific attribute that says missing value, with whatever qualifier you want on it, and have it there in your metadata and tell your users if you want to know what the field value is, read this particular metadata attribute. You're talking about a standardized place so that all tools can recognize it. And that's probably hard to achieve. I think that the rest of your solution of defining a valid range, is a good one, because you shouldn't have a missing value that's in the middle of your data range. So you can have however many numeric values for different missing conditions of the valid range, and then just tell your users what they are. I think that's a reasonable solution.

 Audience member (Peter Cornillion?): Yeah, I guess I don't quite understand why there's a problem in having an attribute that says missing value is whatever. I mean, you have attributes for a million other things, right? And this is one of the problems that we had in our data system. Every data set that we looked at used a different convention for a missing value, so we encoded a little, in our little ancillary files that I was talking about a field that was missing value, and then we said what the missing values were. Now, admittedly they mean different things, but it was a starting point. You could have several missing values. I don't quite understand what the problem is.

Panel member: Well, one of them Ted did allude to and that is that you can attach an attribute only to a whole grid, and different layers in that grid may require different values. Some of them, and then also you have the problem of what type of, what your representation of the data is. If it's integers, you know, if it's an 8-byte integer, then you know you've got to choose, people choose sometimes to scale their data in you know, to fit, to maximize the dynamic range of their word type they're using. So it's very difficult to make it, well obviously it's impossible to make it universal. And in the HDF case, you can't assign it to the individual fields in an array. I mean, with use, well, using, using the HDF attribute for annotation. Obviously you could in the metadata. There's no problem to doing that. And if we agree upon a name and a standard, then we can try to get everybody, recommend that everybody use it.

Audience member (Peter Cornillion?): That's what your doing for a lot of other attributes, right?

Panel member: Right. I'm, I know this discussion did take place, also in the context of the data model, of there being a missing value. And it was rejected, and I don't have the main data modeler here to ask, he's no longer working on this, this program, of what was behind that decision, but—

Carlisle: OK, why don't we, we will take an action to go and investigate this some more, see what approaches people are using, and if there's anything we can do without, I mean it sounds like maybe there's ways to do this without the additional implementation. So we'll go see what approaches people are using and see if there's something we can put out as a recommended approach that won't, so we'll take that action. Is that all the discussion on this item? Matt.

Matt: This is not directly related to missing values, but back to fill values. Is there a possibility of having multiple fill values?

Ilg: Well, in HDF-EOS we use the HDF fill value. So...

Matt: And that's the same?

Mike Folk??: No, it isn't quite the same. There is only one official fill value permitted per SDS in HDF. It's one of the few sort of hard-wired attributes. Our thinking there was similar to what's going on with this missing value thing, that if more than one fill-value was needed, then people could define those as attributes, using some convention such as "fillvalue_so_and_so. And while I have the microphone, I agree with Peter, you're not going to have perfection, you're not going to satisfy everybody, but it should be ok to put missing values in there. Sometimes they'll work, and maybe there'll be a few times when they won't, but at least there will be a clue there that the tools can use to try to accommodate them.

Ravi Sharma: If we are measuring, Ravi Sharma, if we are measuring multiple parameters, let's say multispectral sensing, and if one of the values is missing, how that will be handled?

Ilg: Maybe you would fill the whole thing with a fill value, or with a missing data flag? Assuming that you had a convention for how to put a missing data flag in there—"missing_" or something like that. You could simply leave it out, just not define that field for that particular data granule. Nobody else seems to want to answer.

Carlisle: OK, are we ready to move on to the next question? I still have a bunch here.

 

—Question 5: TAI—

Carlisle: OK, the next one is: "Is TAI still the mandatory time unit for HDF-EOS files?" And Doug is going to address it.

Ilg: I don't know of mandatory is the right word. It's certainly recommended in that ECS will only understand time values that are in TAI 93, which is, for those of you who don't know, a double-precision, floating-point value representing fractional number of seconds since January 1, 1993. So you can use any unit you want, as actually Linda pointed out to me, reminded me in a recent e-mail message. It doesn't matter what you use for time, as long as it's something that's monotonically increasing or decreasing so it actually acts like time does. The subsetting will still work properly. But if you want any sense to be made out of those values, you're going to have to use a format that ECS understands, and TAI 93 is the only one that ECS understands in a native mode. But with the metadata and time tools that Abe Taaheri spoke about this morning, you do have the ability to convert back and forth between different time formats. So using TAI 93 isn't as difficult as possibly it was at one point.

Peter Cornillion: This is Peter Cornillion, and I was asking how do you handle time and climatologies where there's nothing, 1993 is a reference, it doesn't matter, I mean, every, you have Januarys repeat, every January; January 93 is the same as January 92, it's climatology. Does that question make sense?

Ilg: Uh, it probably does, but I'm not sure I understand it yet.

??: If somebody wants, are you saying if somebody wants to search for a granule based on time and they ask for—

Cornillion: Time doesn't, every January looks exactly the same, so January 93 is no different than—

??: You don't have to assign a time to a granule, if you don't want to. If you want to put it in the inventory just the way so people can search by time

 

Cornillion: Right, so I want to find the January granule, or the January 1st granule, but there's no year associated with it.

Khalsa:  This would be a client function. You're talking about a particular client that allows access to the archives where these granules reside. But you don't have to encode a time, you don't have to assign a time in a granules metadata if you don't want to. There can be static files, for example, in the ECS archive, that don't have an observation time associated with them, so presumably this would fall under that class. If there are keywords like climatology attached to a collection, then somebody using the client could look for collections that have climatologies in them by searching on that keyword.

 

—Question 6: Flushing the Buffer During Malfunctions—

Carlisle: "Is there any way to flush the buffer during malfunctions, core dumps, so that the HDF file is usable? If a file is not closed, it is useless." Mike Folk is going to address this.

Folk: Yeah, unfortunately, there isn't a way to flush, there is not currently a way to flush the buffer without actually closing an HDF file. That's one of those things that we've been aware of, and wanted to do something about for quite a while, but going back and adding a routine that does that would be fairly difficult. It's something on our to-do list, but it's not real high on the to-do list because of the amount of work that it would involve. It can be raised again, depending on how important and what resources there are that we can apply to it. But part of the problem is that the code had undergone so much development by the time we realized we needed this feature that, there was code corrosion and all that other stuff that has to be taken care of, just a lot of dependencies. I will say that in HDF5, it's already there, because we knew of the need for that.

Carlisle: Any more discussion on that issue?

 

—Question 7: Impact of PI Processing—

Carlisle: OK, the next one. The person that submitted the next few questions didn't identify him or herself. The first one is impact, but there's some of them we don't understand, so we think you may have to identify yourself before the end here, OK? OK—Stephanie Grager from JPL—she doesn't need the mike yet. OK, the first question is: "Impact of possible PI processing on ESDIS, and the development/support of HDF-EOS?" And I'm going to address that one. To give some folks a little bit of background that don't have it, the original vision or concept of EOSDIS was that science investigators, otherwise known as PIs, would develop processing algorithms for those data, for their data, and those processing algorithms would be integrated into EOSDIS, and then all the science processing would be done at the DAACs. Now there's been a movement toward something called PI processing, where under certain options and circumstances, and there's a decision process for how that works, an investigator could in fact at his or her facility do their own processing and then just deliver the data to the ECS core system for archive and distribution. So this question is asking what is the impact of that on ESDIS and on HDF-EOS. The impact on ESDIS is that we just concentrate on different things—instead of concentrating as much on how to do our production, we now concentrate more on how to do our archive and distribution. And it turns out that the EOS core system was originally way undersized, and, in terms of when they did the estimates of how many lines of code that were going to be used. So there's still plenty of lines of code for all. In terms of HDF-EOS, we are still going to support that, because these, maybe even more so, because the PIs will be developing HDF-EOS files, and they are still going to need things like the HDF-EOS library and the SDP toolkit. Any more discussion on that issue?

 

—Question 8: Internet Access of EOS Data Sets—

Carlisle: OK, the next one, and this may be one where we weren't sure we understood your question. It says "Internet access of EOS data sets, bandwidth—is this an issue?" And Tim Gubbels is going to address it.

Tim Gubbels: I'm just going to take a quick cut at this. And Rama may want to come in and clarify the policy statement. The available bandwidth to users is something that we really can't control, but it's going to exert a profound influence on how users will interact with EOSDIS, at one end being that you can operate client but you can't even download some of the smallest data sets effectively; at the other end, fairly large downloads are very easy. So as far as I understand, it's going to be up to the DAACs to make the decision which data sets to encourage download and to set up on FTP sites and which data sets to generally deliver by air express. So I believe that's a DAAC-specific policy decision, and they'll probably generally err on the side of setting up large files on FTP sites, and good luck. So that's about all I can..

Carlisle: Does anybody from the DAACs want to comment on that?

Cornillion: This is Peter Cornillion again. It seems that there are two issues involved in moving data over the network. One is you could either move a large volume of data over the network, which imposes a big impact on the network, or you can subset the data at the provider's site, which can impose a big impact on the provider's site CPU, but you might just then be moving small amounts of data around. And it seems that when the DAACs make the decision of what they're going to serve out, they should try to balance those two things, because my guess is that the bulk of uses from the scientific community are going to be for fairly small data sets or small chunks of data sets, and yet everything is geared to moving very large pieces of data sets around and letting the user subset them.

Gubbels: Hello? I also failed to mention that really it is the bandwidth issue that's the historical driver for the subsetting service at the DAACs, within ECS, or as a DAAC in the extension. So it's quite likely that an environment of low ambient bandwidth will drive intensive user execution of the subsetting service.

Carlisle: OK, can we move on to our next question yet? OK.

 

—Question 9: Impact of HDF5 on HDF-EOS—

Carlisle: The next one is "Impact of HDF5 on HDF-EOS—preliminary impressions and thoughts?" And Mike and Doug are going to address this.

Ilg: OK, I guess we're going to attack this from a couple of different perspectives. First, the impact on any programs you may have written that use HDF-EOS should be almost nil. If we make the decision to move to HDF5, then HDF-EOS would be rewritten on top of HDF5, and any programs that you've written, at least that stay strictly within HDF-EOS, would be unaffected, except for a recompile, or actually a relink.

Audience member: Doug , there are some features in HDF5 that Mike talked about that I think would be very nice for us to be able to use. So I guess I'm less concerned about rewriting code as I am concerned about getting some of the features that will be available and useful into HDF-EOS.

Ilg: OK, the, I guess that would be more of an ESDIS and ECS decision as to whether or not new development is done on HDF-EOS to allow some of the more, the neater new features of HDF5 show through the HDF-EOS library. That would, obviously, require new code to be written rather than just refitting the old code.

Folk: Well, actually, if that was what the question was, then I don't think I have anything else to add. There are some other things I could have spoken about. We did spend quite a bit of time and effort last winter addressing this problem and have white papers and other documentation that I'd be glad to share with people on what to do about this problem of trying to live with both formats, or trying to go from one to the other, and so forth. At that time met with the ESDIS folks here, and it was clear that we definitely wouldn't want to do anything before the AM-1 and the Landsat launches, so we put it on hold and haven't really addressed the issue that much. But it is something that we obviously have to revisit.

Ilg: Maybe there's one more thing that I could add. Some of the features that Mike discussed, one of the most important ones, is the use of larger files, and also improved I/O speed, and those you would get sort of for free, just because you're now going through a better library underneath. So some of it will come through whether we want to or not.

Carlisle: Any more discussion on that issue?

 

—Question 10: Variable Length Records—

Carlisle: OK, the next is: "Variable length records in HDF-EOS swath and in HDF5?" And again Doug and Mike are going to address that.

Folk: With HDF5, we already have variable length records. I'll talk primarily from the data structures point of view rather than the swath point of view. We already have the ragged array structure implemented, for example, although it's just a prototype. We're probably going to change the way we store it internally, but we already have an API for handling ragged arrays, and that probably won't change very much. Also there is a variable-length data type in the specification, so for example whereas in the current HDF, HDF 4, you can only have fixed-length strings, you can have strings that are of variable size in HDF5. And another thing that of course helps that's in both versions is if you use compression then you can have variable-length things embedded in larger fixed-length object, where you use fill values, or whatever, to pad the part that isn't real information. In this case you don't take a performance hit. There is a disadvantage to that approach however, and that occurs when you ask about the size of something--you don't get the true size that you would if you were using a variable-length data type. To sum up, I think we're doing a fairly thorough job of addressing the need for variable length objects in HDF5, and eventually we think the data structures will also be there to handle them efficiently. So I'll let Doug talk about the implications for swaths.

Ilg: I think for swaths we really need to know a little bit better what you mean by a variable record in a swath. You lost your mike?

Audience member 1: Well, for the AIRS instrument, the AIRS instrument has 2400 channels per latitude and longitude location, and in the AIRS, the level 1b files, that is huge—the files are on the order of a gigabyte per granule; we've chosen a quarter orbit as our granularity. Now in the calibration process we have what are called these DC restore files, and there may be a one-to-one mapping to the channels and then other times there may not be. And we don't know ahead of time what that is, and we want to save these files along with the geolocation information, and it doesn't make sense to have a fixed length when we may just need one or none. And so you would double the size of the storage requirements when it certainly doesn't need to be. If I just, I thought of ways around it, but they're all kinds of contortionist type of solutions.

Ilg: OK, do I understand that this is per granule? It might change from one granule to another?

Audience member 1: Yeah.

Ilg: OK, since a single swath only can exist in one granule, each, then each granule is its own separate problem, and you can implement each one, you know, to have as many channels—

Audience member 1: No, this would be a single swath, the DC restores are in a single swath, but say for instance the AIRS channel has 2400, there's 2400 channels, so for the data themselves, it's not a problem, but with the DC restores, I'm, I have latitude and longitude and I may not have 2400, I might have one, but that's all in that swath, so I might have 2400 here, one here, think of it in the vertical, as—yes.

Audience member 2: I think what they're saying is that on top of the, on top of the individual channels, as an additional layer you have some QA stuff.

Audience member 1: Well, calibration.

Audience member 2: Or calibration. OK. So, but it won't be for every single location in the swath, it's going to be very sparse there, so they're going to define the layer, but it's going to have big voids in it. That's not a problem, right? I mean, you—

Audience member 1: Well, you can double (??) the size of the file, if we define a fixed length.

Audience member 2: But if you, but if you fill it with the fill values and compress it, then it'll go away. If you, if you, if there's not data, you use the HDF fill value, and then you compress the file, you will regain a lot of space, you're not using any of that, right?

Ilg: What- If you do something like that what you will lose is a little bit of performance, because it will have to compress and decompress. It's probably not a horrible issue, and the compression is available on a per-field basis for swath. So you can do that, compress only that one field. It's probably not really the elegant solution you're looking for. That's all we have for you right now. I don't, maybe you can make some suggestions about how, well, maybe offline we can talk about the workarounds that you've thought about, and maybe one of those workarounds would be appropriate for an actual, you know, for modification.

Carlisle: That's the sort of thing we set up those consultation sessions for, that if you have a specific problem you want to work, for example we have HIRDLS signed up for a consultation session to try to work some of their instrument-specific problems. So I think there's still a couple of those available if you want to sign up and we can talk about that some more.

 

—Question 11: Filters/Conversion Packages—

Carlisle: OK., the next one was "Availability of filters/conversion packages to HDF-EOS for other common formats, for example GRIB BUFR, UMO radiosondes, location of filters?" and Doug is going to address that.

Ilg: I was going to mention the, a couple of converters that I know of. One that's available through ECS, but I don't think there's an official distribution channel for it, there is a GRIB to HDF-EOS converter, I don't know exactly, I guess GRIB is a grid-like format, so I guess it converts it into grids. The BUFR converter, there's been a, there's a, I guess a plan to create one. It's not been created yet. These were part of the data migration effort that was done earlier on in the ECS project, and that's been sort of put on hold now, so I don't know when a BUFR converter is going to be available.

Marinelli: The BUFR converter was actually in the ECS schedule to be delivered postlaunch, and as far as I know it's still in there. We're going to look into getting you the GRIB to HDF-EOS conversion, it's not packaged by itself, it's part of ECS and we'll find some way to provide you the source code.

Ilg: And the other converter I was going to mention was [reconstructed - lost audio] the set of GIS converters. This is a set of six programs that can go back and forth to grid, point and swath.[end reconstruction from the appropriate pieces of an ARC/INFO exchange format. And those are the only ones I know of.

Audience member: And the BUFR, or the GRIB, you're going to try to make source code available for that?

Marinelli: Yes, but the BUFR doesn't exist yet.

Audience member: And is the same true for the other stuff, Doug?

Ilg: I don't actually know. The code hasn't been delivered yet. I assume it's going to be available as source as well, probably more likely as source than as binaries, because we don't have 100 machines to compile it on.

 Audience member: Doug—Any time frame?

Ilg: I don't know the schedule for that right now. The programs are essentially done, they just need to be sort of polished up and put in a nice usable form.

Audience member: Can we ask the conveners of the meeting to provide that information when it's available?

Audience member: We can ask—

Carlisle: Will you right a note, Ted, to remind me? OK, anymore discussion on that issue? OK, what we're going to do is—

Folk: Yeah, I have one thing. There was the ARC/INFO converter—Doug, you just need to correct me on this. I understood that that was done by ESRI, is that right? And that since ESRI uses a proprietary format, they would not make that available. Is that wrong?

Ilg: The format has never been publicized, but it's not one that they hide, necessarily, and all the work was done by STX.

Folk: OK, so the source code is available for that?

Ilg: Yes.

Carlisle: OK, more discussion on that issue? OK, what I'm going to do here is try to get through all the questions since we did move one of our demos from this morning up, from this afternoon up to this morning, I think we can run over a little if Ben says it's okay.

Ben Kobler: We can run over a little bit but the demos tend to take longer than scheduled so let's try to finish up if we can.

Carlisle: OK, all right.

 

—Question 12: User-Developed Software—

Carlisle: The next one is: "Is there a repository of user-developed software anywhere which utilizes the HDF-EOS API?" And I'm going to address that. And an example of such a thing is that NCSA does in fact have a repository for HDF software, and Mike talked about that a little bit, I think. And the caveats on that of course are that people put software into that repository and NCSA doesn't support it but it does make it available and publish it. And the reason I'm using that example is because we do have an HDF-EOS Web site, which I gave out the URL to everybody, and we are certainly willing to do that sort of thing for HDF-EOS with the same sorts of caveats—that we'll publish it and make it available, any software that people would like to contribute to us, but you know, we of course can't be responsible for maintaining it or supporting it.

 

—Question 13: Large Objects—

Carlisle: OK, the last question on this list is: "Java tools, HDF, access to large objects— define a large object." And that's going to Mike Folk.

Folk: OK, for us a large object is 1 megabyte. Yeah! Now I think this is a solvable problem. It's one that we're working on, and I just learned while we were discussing this that Lee Elson has had better success with WebWinds than we have with JHV, need to share ideas and learn from him about that. I'll also say before I ask Lee to comment that that's our highest priority item right now: figuring out ways to deal with large objects. When people first look at an object, we just subsample it, so there's no problem there. You can look at an entire image of something, but it's a heavily subsampled image. It's when you want to look at the real data and the real un-sampled image and you want to bring that all into memory at once that there is a problem. If you you want to pan back and forth you need to either have it all in memoryor do some sort of paging or something like that. So that's something that we're working on. Lee, do you want to add some comments here?

Elson: Yeah, just a couple of comments, and maybe some numbers. There are two problems that Java has right now, and one is that the I/O is not always very efficient. That can be worked around, we've managed to work around that pretty easily. So I think the 1-megabyte limit that JHV has is something that could probably be worked around fairly easily. There's a second limit that I think all Java has right now, and that is I don't believe that Java can use more than 96 megabytes of memory, and so that's perhaps a limitation, although I don't know of too many analysis or visualization tools that regularly need to use more than 96 megabytes at a time. It may mean being creative about how you bring things in and out from disks. Now Java is about to release a capability of doing tiling, so that should help, that should be built into it. So I think the bottom line is that in Java the size limitations are for the most part can be worked around, and I think in the next year or so should go away entirely, I don't think there will be any limitations.

 

—Question 14: Getting Data into an HDF Granule—

Carlisle: OK, moving right along here. Our next question was submitted by Tim Gubbels, and at first I told him that it was unfair for an expert to be posing a question, but he was doing this on behalf of some other people, so I let him get by with it. And his question is, "If you are a new data provider, how do you get your data into an HDF-EOS granule, given the bewildering array of utilities and tools available? What is the simplest solution for this?" And Siri Jodha has volunteered to address this.

Khalsa:  Well, I ran upstairs quick-like and wrote up an answer, and I didn't get back down in time to run it by everybody else. So I may be corrected on the fly here. The recommended solution, the simplest solution, well, you have to define simple. It may involve fewer steps, but in doing the process, but maybe more steps in acquiring the pieces. So there are tradeoffs there. But let's talk first about the recommended solutions, and then talk about some of the options. Though there hasn't been a lot of use of HCR yet, it is going to be officially released and promoted shortly, but it’s main purpose is to simplify this process for data providers. So when you get the HCR package, it includes the HDF-EOS and HDF libraries, so for generating granules, it's complete. Then there's a metadata issue. So these are the steps that I outlined. First would be to write an HCR, and there's an editor there, so you can do that with a GUI, or you can just use your favorite editor and write this ODL file that specifies the structure of the HDF-EOS file that you're going to set up. OK, and then you run the other utility to actually generate a skeletal granule from that. Then you would use calls, either HDF-EOS calls, or if you've got your arrays all set up, I think you can just use an HDF call to insert data into that file. OK, so then you've got your data and your granule, and the only thing left to do is to put in the required metadata, and to do that we also have the new tool that was discussed this morning. So again, you'd open up the file that you've now populated and write metadata into it, but to do that you would need this metadata configuration file, which is also in ODL, just like the HCR is. Obviously, you have to get conversant in ODL in order to do these things efficiently. But the MCF can be generated from this online tool Metadata_Works. One of the options with that tool is create an MCF. Using Metadata_Works has the added advantage of generating the collection-level metadata as well, which would be necessary, and this is where I went out on a limb—if this data is eventually going to reside in ECS, then you definitely have to go through this final step, but if you're going to have your own archive, I don't really know what the federation is going to look like, but if you're an investigator, and you're going to set up your own archive, and you want people to come to you to get data, well if you want your data to be searchable by whatever network of catalogs or whatever search tools evolve out of this, then there's going to have to be some generation of this collection-level metadata. But at a very high-level, the directory level that you might use could be the Global Change Master Directory. In order for queries like "Give me surface temperature data over Africa," to be possible, then you're going to have populate more detailed collection than granule-level metadata. And Metadata_Works is there to help you with that process. So the only other thing I wanted to say was that you don't really need the Toolkit. As I said, in order for a granule to be acceptable as HDF-EOS it only has to have a minimum set of core attributes—four. And those, the ODL file that would carry those four attributes is very simple and you can just use a template and put your data set name and your version number into that, and then use just a generic HDF call to write that as an attribute with a specific name, core metadata, into the granule. So you don't really need the Toolkit_MDT library to populate the minimum metadata.

Audience member: Does the HCR generate structural metadata?

Khalsa:  It does. Right.

Ilg: That's its basic purpose.

Khalsa: Right.

Audience member: Structural?

Ilg (?): It's just for structural metadata.

Audience member: Could you make copies of that, for back here?

Carlisle: Oh yeah. We'll get you copies of that, sorry.

Audience member: And I have a suggestion. Probably it wouldn't apply, I don't think, to this workshop, but whatever the next one is, three, might be a session on just doing this. Or, in other words, a case study of either a real or a hypothetical data product, you know, probably a fairly simple one, where you go through the whole kit and kaboodle, I mean, you know. You know, in terms of defining the granules and actually writing the code and producing the products.

 Khalsa: Have you looked at any of the HDF-EOS documents? There are examples in the back that have metadata, you know.

Same audience member: Yeah, I'm just saying it might be worth going through that in one of these workshops.

Khalsa: Yeah, right.

Audience member: The HCR package, which you say includes the HDF-EOS and HDF libraries, does it include the infamous modified GCTP library as well?

Khalsa: Yes. That's part of HDF-EOS library.

 Ilg: That would be the binary version.

Khalsa: Yes. These are all compiled.

Ilg: The GCTP library that comes with HDF-EOS is precompiled, and I think maybe we're going to, is there some thought of changing the way that works, so that the actual source goes out? I know it was considered to be a convenience to precompile it for everybody, but for those of you who are trying to port to other systems, it doesn't really help much.

Audience member: When you do a stripped down tool you probably do the source for it? (inaudible)

Audience member: Even the binaries that you provide, I often cannot link them directly with Matlab, for example, because of various compilational link options necessary for linking this in Matlab, executable. It's necessary to have the source. Earlier in the year when I discovered, because of getting missing symbol errors, that this was necessary, I downloaded it from the SGS, and I asked about in the EOS Tools mailing list, I asked about support for this package, and I got the answer that if USGS was unable to provide support for it then the EOS program would take that over. Now that seems to be an iffy answer, at best. I feel like I've been on somewhat of a wild goose chase with regard to this particular library.

Panel member: So if we get you the source? (inaudible)

Panel member (Klein?): The problem with USGS stuff is we got it from them and they don't support it. And we're not about to make algorithm changes, we relied on them to get it right in the first place. But as far as installing it and getting it to work then we have to provide some support for that. Because we adopted it as our, we had requirements for certain projections and we adopted that package, so we have to maintenance.

Panel member (Folk?): I say one more, there's a thing about HCR, a sort of little known piece of information. Unbeknownst I think to Larry, we actually continued to work on HCR after, after we delivered it to them. We thought we were going to get some funding to continue it, so we started working on it. Actually, we didn't get the funding, so we've now stopped, but what we delivered to them was just a Solaris-only version, and since then we've ported it to several other platforms, and fixed quite a few bugs, so probably the version that you guys have needs to be updated.

Audience member: Are you telling that to the workers?

Panel member: Yeah, in fact, we just sent a newsletter out with that information.

 

—Question 15: Metadata Tools—

Carlisle: OK, there was one more question that came up this morning that I wrote down, and I forget who asked it, but it was when Abe Taaheri was going over the Metadata Tools, and he said we've delivered that to STX but he wasn't clear on what the process was after that, so I asked Doug to address that.

Ilg: We currently have versions of the Metadata and Time Tools that are running on Sun, SGI, got to get this right, I think HP, and also PC. What we're missing so far in our promised set would be the DEC Alpha and IBM RS6000. The release should be sometime soon—we're still waiting on getting access to particularly an IBM, but we're working on getting that, and getting the proper compilers on it and everything, so that we can actually compile the whole thing. But I guess anyone who's interested in getting a copy that is at least guaranteed to run on one of those machines, you can send e-mail to me. My e-mail address is on a lot of the handouts that went out, my phone number too, so I can get you in touch with, get you a copy of those libraries.

Carlisle: OK, we've run over our time, so I'll turn the floor back over to Ben.

Kobler: I'd just like to thank the panel for answering all the questions and taking their lunch period to prepare them. Thanks again.

 

Return to Top of Ask the Experts Transcript