My delicious links as a Wordle image:

I had a bit of an epiphany lately in my thinking about Code4Lib (what it is, what it should be, etc.) It’s all thanks to a post by Ed (who has the ability to shift my thinking every now and then).
I’ve always been of the mind that Code4Lib is an experiment… that it shouldn’t be centralized, organized, etc., and it’s colored my thoughts on other Code4Lib-ish issues (conferences, projects, etc.) I’ve ranted on at times that all these Code4Lib-ish things should be as Code4Lib is. I don’t think so anymore. In fact, I think thinking in this way was sort of putting the cart before the horse. I think thinking like that was actually trying to see Code4Lib in a centralized way — see it as a single thing.
My new approach to Code4Lib is that, if I don’t have a strong enough preference about something to actually get involved with it, I’m going to refrain from commenting unless, of course, opinions are solicited. Then I’ll just offer an opinion and go on my way. Take, for instance, the Code4Lib Planet. Sure, I have opinions but, no, it’s not something I really want to take up at this time. For that reason, I should just let the editors do what they want (which is what Ed said in his post basically).
In a way, this is approaching activities in the same way that I would an open source project. So what if the editors do something I don’t find useful. There is nothing stopping me from setting up my own Planet of Code4Lib authors for my own use if I feel so strongly about an issue. If I find an open source project that is close to what I want, but not quite there, it makes more sense to use it as I need to, modifying it as needed (rather than to try and sway the project’s owners away from their well-considered path).
This doesn’t mean I shouldn’t offer suggestions if solicited, or provide feedback about how something might be more useful to me as a random user, but there’s no need to feel some sort of distributed ownership over anything just because it’s Code4Lib-related. That just gets in the way of those who are already doing great work. If there is a project related to Code4Lib that I want to work on, I can work out the sticky issues with those who also want to put in the time.
As with most epiphanies, this isn’t really a big revelation. Most people were probably already thinking this way. I guess when you realize you’ve been deluded, though, it seems like a burst of light, voices on high, or something like that.
This quote from Mark Diggory on the OAI-ORE mailing list:
[warning... stereotyping ahead]
This is an argument on a continuum of evolutionism vs. creationism…
Evolutionist say, establish the smallest possible set of laws to
enforce on a system and see what emergent behavior arises… while
the Creationist say, define the entire mechanism, top to bottom,
written in stone, and damn all who do not comply. Seems to me, folks
that come from the RDF World ascribe to the former and those from XML
Schema world tend to the later, both could stand to learn a little
from each other.
It’s an interesting quote (other than I bristled a bit about being cast into the ‘creationist’ camp)…
I wonder if a third alternative might be the relativist. The relativist looks at the world from a particular local perspective and codifies it. Unlike the creationist, she does not seek to enforce her structure on everyone else, but sees it as the solution to a particular problem. Unlike the evolutionist, she does believe there is a need to validate her data and confirm that it fits with her worldview.
She’s a relativist because she looks beyond her particular assumptions and perspectives and realizes there are lots of other communities out there (like hers) that have a perceived need for valid data. Rather than say her schema is the one to rule them all she looks at ways to “trade” (share/crosswalk) information with other like communities. It doesn’t mean she wants to share with them all, but just that there are some with which she wants to have a relationship.
Dan: Does the linked data movement really depend upon RDF? It doesn’t seem like it has to. Maybe it could grow faster if it didn’t.
Bruce: Let’s turn the question around and ask: if not RDF, then what? You definitely need some model on which to base it, it seems to me, and things like GRDDL, microformats, etc. leave a lot of flexibility on the encoding end. The key for linked data is really the URI, of course, which becomes kind of like a key for a global database.
Me: Does it need a single data model? It does if machines are doing all the work automagically, but if there are people involved does it? The key to me seems to be the “linked” part. I really like Dan’s coining of “Linked Description” — this doesn’t seem to be splitting hairs to me, like Bruce suggests, but more of a recognition of a component in the process which isn’t considered necessary in the Linked Data perspective.
Regarding my last post… on the other hand, if I look at the conference as an outgrowth of Code4Lib (like the journal is an outgrowth) what does it matter? As long as the issue isn’t should (or shouldn’t) we formalize the Code4Lib group, but just should (or shouldn’t) we formalize the conference do I really care? Probably not.
The Code4Lib conference is fast approaching and, as people start to spend more time thinking about it, there is the question of how things should be done differently over the coming year. It’s a good question and one that Code4Lib does well to ask itself. Any “organization” that doesn’t question whether it’s doing things as well as it could be is already dead.
I think the question was originally posed on the mailing list as to how much structure Code4Lib should put into place to ensure that these conferences continue to happen (and perhaps grow). The issue has also been framed as a question of how much formality do we want to put into place (and where is the proper intersection between structure and formality)? My gut reaction was that I don’t want much formality and that any structure beyond a list of things that need to get done (and a list of volunteers willing to do them) is too much.
I think this is in part because I see Code4Lib as an experiment. There are plenty of well established (national and regional) organizations that put on conferences year after year (they are very dependable). Code4Lib is, for me, the exact opposite. It is a loose group of people who have gathered around a particular banner (library technology) because it interests us — it is more than just a job or a way to pay the bills (though many people in the Code4Lib cloud work in day jobs that do just that). In the end, though, I don’t need Code4Lib to fulfill any professional obligations and, while I would be sad to see it disappear, I wouldn’t lose any sleep over it either — I don’t see it as a movement.
It is this loose association that keeps me interested and coming back. There are times that I do not feel myself in sync with the group and there are times that I do. It is a relationship that I can take or leave, and I do (over and over again). I do not see Code4Lib as some sort of centralized organization that has a majority opinion. There are, at times, majorities of opinion, but they are no more representative of Code4Lib as a whole than the minority opinions because there is no consistency in them (the majorities) over time. I think this is, in part, because Code4Lib hasn’t, up to this point, been formalized into a “real thing” that needs to take a stand on issues (e.g. has a need to promote a consistent opinion, works as a force in the library community, etc.)
The issue to me, I think, boils down to a question of decentralization/centralization. I’d like to see Code4Lib stay as decentralized as possible. I think a couple years of conferences have taught us that we definitely need some procedures (best practices). We also need a way to recognize people who volunteer to step up and take responsibility for something that needs to get done (for the fun stuff to happen). I think we’ve also learned after a couple of bouts that we would benefit from some code that makes making decisions, as a group, easier.
These to me, though, are more indicators that we need better documentation (and more communication) than they are indicators that we need a centralization of the organization. Both approaches would be valid ways to go… don’t get me wrong. We could document more (put our organizational knowledge into external form) or, just as valid, we could rely on people who would hold offices over a period of years. These are just two different paths. I think whether a person chooses one over the other probably has to do with where s/he sees Code4Lib going in the future (that is what paths are good for after all — or, maybe I should say, “Whether one sees the path as the point or whether there is an attainable destination in mind”).
The David Clark quote seems relevant: “We reject kings, presidents and voting. We believe in rough consensus and running code.” It is dissected in a paper up at W3C:
Open Participation ("We")
citizen engineer: citizen is a contributor to her space (lists, Web, MUD, FAQ)
Consensus. ("... believe in rough consensus ...")
is it good enough, does it merit moving on, are there show stoppers?
No Kings ("... reject kings, presidents and voting.")
consensus mediated by Elders, citizen engineers who built the space and institutions others inhabit.
Running Code / Implementation ("... believe ... in running code.")
all policy is tested by both its support and formulation through implementation
I do think that having the conference mediated by kings (okay, that’s a loaded word isn’t it?) would ensure the successful growth of the Code4Lib conferences more reliably than just having better documentation would. Maybe. But, on the other hand, I’m okay if we have a few failures (e.g., have a conference or two that is a real stinker). We’ll learn from these (with good communication) just like we learn from making programming mistakes or faulty assumptions in our code. If we don’t learn and Code4Lib disappears that’s okay too. It was what it was and lived as long as it was useful.
I think though that creating a centralized structure that manages things from the top down will be a mistake for the group. I like the idea of having the local folks who are going to plan and host the conference take more of the responsibility. This doesn’t mean they have to take it all (especially if they don’t want to). I think this though will give each conference its own flavor (and that’s a good thing). If we rely on a centralized group to make consistent decisions year after year, things will go along smoothly enough but we will lose, in my opinion, some of the uniqueness of the conference (if it can be said to have characteristics after just a couple of years).
Maybe people would prefer the consistency. Maybe people want Code4Lib to become a force in the library community. I don’t. I’m happy (as happy can be) with the other professional organizations. Yes, they have their share of problems, but I don’t think creating a new professional organization is going to solve any of these. Anyway, enough crazy rambling…
How do we approach the problem of metadata (and which formats we should use)? In what way is metadata like cataloging and in what way is it different? What is metadata anyway?
I see metadata as cataloging. Libraries have a strong historical tradition of organizing materials. I think it is fair to say that the way we’ve done this has changed over time with the emergence of new technologies. There are at least two questions that should, in my opinion, guide us in participating in the process of evolving our cataloging: “How do we continue to do what is appropriate and right (contextually speaking)” and “How do we adjust what we do to fit our changing environment (e.g., to take advantage of new perspectives)?”
I see metadata as cataloging; It is what cataloging is becoming. It is the same thing, and yet we give it a different label because we need to adjust our perception of it… just enough so that we can question it. We need to distance ourselves from it… just enough to look at it differently, make judgments about it, and then reintegrate our new perspective into our work. It is at this point, I believe, that “metadata” becomes “cataloging” again… the need for a different label is gone – until the process begins anew.
The same is true of the “digital library” — what is a digital library? Isn’t a digital library the same thing as a traditional library? In a way, I think, it is a trick we play on ourselves to allow us to question some of our basic assumptions, make some discoveries (and mistakes), and then integrate what we’ve learned. At that point, the “digital library” disappears and there is just “the library” (we may even look back and think, “Wasn’t it there all along? It sure seems like we were doing library work.”)
I believe a key question when looking at metadata formats from the digital materials community is, “How do libraries differ from other institutions of cultural knowledge?” For instance, are there similarities between libraries and other institutions of cultural knowledge on which we could be building? How similar are our missions or should the differences between our materials be the determining factor? What role would a unified authorities system play across a diverse material environment?
Ultimately, I think we should make metadata decisions based on the context of the digital materials community (and the materials that we are digitizing). It isn’t that digital materials are somehow fundamentally different from their physical counterparts or that a different level of description is required. The decision is made to foster an environment where people working towards the same goal can share ideas, tools, and realizations.
The purpose of metadata is to ask the questions; when we understand the answers, I believe, we’ll realize we’ve been cataloging all along (but we have to ask the questions to truly understand the realization). We could liken the question of whether metadata is cataloging to the Zen parable:
“Thirty years ago, before I had studied Zen, I saw mountains as mountains and rivers as rivers. And then later, when I had more intimate knowledge, I came to see mountains not as mountains and rivers not as rivers. But now that I have attained the substance, I again see mountains just as mountains, and rivers just as rivers.”
All this is sort of the philosophical underpinnings of a viewpoint. It is a different thing to implement this perspective as a way forward, I know. For instance, libraries have large numbers of catalogers who are trained to catalog. It may or may not be interesting to them to look at what they do from a broader perspective (though it should be, because what they do is interesting from the micro and macro views).
Pragmatically, I’d suggest librarians start with MODS and MADS. It is the most similar to what they already do so won’t seem like such a stretch. Get comfortable with it. Look at what moving the information around a little does to the overall organization (in comparison to what it would look like in MARC record). Look at what is added to and left out from the MARC.
It is important, though, not to stop there. After all, the Library of Congress is not advocating that people stop using MARC in favor of MODS. MODS and MADS are just one set of (possible) answers.
To truly investigate the question of evolving cataloging we need to look from a variety of perspectives. If MODS and MADS is as far as one wants to question, I think it makes more sense to just stick with MARC. MARC is more granular than MODS/MADS and, though it has its fair share of organizational problems (even related to its level of granularity), it is certainly well understood by a select group of trained professionals (you know who you are).
To work with metadata (in the sense I’ve been discussing here), one should look at other metadata formats… EAD, TEI, Dublin Core, VRA Core, and even XOBIS! Look at similarities and differences. Discover why decisions were made when the authors created the schema (for instance, were they made based on the material in hand, because of traditions within the community, or to satisfy perceived user needs?)
The goal of working with metadata shouldn’t be to make a definitive and final decision about which format a library should use, but to be open to evaluating and re-evaluating the strengths and weaknesses, uses and user communities, and even the cataloging communities from which the metadata comes. It is important to be pragmatic… not to overwhelm yourself with more than you can digest at a time, but when you have new projects (or have time to investigate) take a look at other metadata formats. You might like what you see.
Awhile back, Andrew Nagy posted an XSLT for turning MARCXML into Solr’s XML indexing format. I thought it would be fun to take his XSLT and do the same thing in XQuery. I think it is pretty much a 1 to 1 conversion.
For the upcoming Code4Lib preconference, I thought about forming an XQuery group. I ended up joining the Java group, though, because there aren’t any native HTTP libs in XQuery (so I’d have to do that as an extension in Java anyway). I still think doing an XQuery group would be fun though.
For instance, one nice feature of XQuery is that is allows you to be as strongly or loosely typed as you’d like. Take off all the “as …” statements from the XQuery and it still works just fine (it just won’t be so picky about what you pass into (or return from) its functions).
Recently, I’ve found myself on both sides of this fence; when working with a little bit of throw-away Java code, I’ve found myself wishing for a little of Ruby’s loose typing. On the other hand sometimes, when experimenting with Ruby, I mutter to myself: “Why can’t this just be strongly typed so I know what to expect and do?”
XQuery really gives you the best of both worlds. This isn’t to say XQuery can do everything those other languages can (it can’t… and far from it). But, if you are working with XML (and want to focus on the data rather than the data’s source) I can’t think of a nicer language to use. It will be interesting to watch XQuery grow as a programming language.
So anyway… since my marc2solr.xq is written as a module you’ll need to call it from something else. This little XQuery (also here) works fine from Saxon (pass in the location of a MARCXML file on the file system as $input):
xquery version "1.0"; import module namespace marc2solr = "http://lisforge.net/ns/marc2solr" at "marc2solr.xq"; declare variable $input external; marc2solr:add-records(doc($input))
I’ve been following (off and on) the discussions on #code4lib about mapping MARC to indices. I know each ILS has a different way of making this happen, but I wonder whether there has been any effort to pool together the decisions people have made (for instance, what MARC fields and subfields should be used for a title, or author, search?) It would be interesting to see how much uniformity (or not) is out there.
I’ve learned from #code4lib that Erik Hatcher is working on a Ruby library that will index MARC (so he has a start on a MARC mapping in a subversion repository). Are there other sources for seeing how people have mapped their MARC (or, even, how they’ve cleaned up their data — I know CDL has a date normalizing library). Is a site where this sort of information could be shared something that other people would find useful (and do our ILS contracts allow us to share it in some generic form)?
This weekend I reimplemented the XMLReader and XMLWriter classes in ruby-marc using Libxml-Ruby, a Ruby layer over the Libxml2 C library.
Currently, ruby-marc uses REXML, a pure Ruby XML library. Since REXML is built into Ruby, it is convenient. I was curious, though, how much of a performance boost there would be from using Libxml2. Here are the results of my very informal test (using some HCL MARC data):
User System Total Real XMLReader [old]: 24.300000 0.030000 24.330000 25.607547 XMLReader [new]: 3.180000 0.010000 3.190000 3.231896 XMLWriter [old]: 38.960000 0.060000 39.020000 41.017238 XMLWriter [new]: 11.950000 0.050000 12.000000 12.607114
Both XMLWriter times include the new XMLReader reading records in from a source file. As a record is read in, it is written out to a new file. This is just intended to get an inkling of what the difference between the two versions might be (not to be a formal benchmark). Lower numbers are better.
So, in reimplementing, I completely rewrote the reader. It just reads from a file and returns MARC::Record objects. What is being used to read the XML is completely swappable with anything else.
With the writer, I changed the encode method so that it now takes an option specifying which library should be used (REXML is the default still). Since the method is public, I figured someone is probably using those REXML Documents returned and their code would break if I returned a Libxml Document instead. The write method, on the other hand, now uses Libxml by default.
I haven’t checked in any of these changes yet (since I haven’t passed them by Ed and don’t know whether they should be incorporated), but I have validated that the existing tests still pass just fine.
The speed improvements are pretty nice. If an extra dependency can be tolerated it would be nice to have the performance boost. The only other caveat is I used the 0.4.0pre01 version of Libxml-Ruby. It might be desirable to wait until the final 0.4.0 release.
Anyway, I’ll get Ed’s opinion on all this sometime this next week. Right now, it is just a fun experiment.