RDF modelling at Joost: no bnodes

Please note that I'm moving all my personal website pages to my new blog on wordpress.com; this page may be removed at some point in the near future.

As I mentioned in a previous entry, Joost™ uses quite a bit of RDF. I'm sorry, but I'm not going to share our full data model with you (though we might do that in the future). All I want to try and do is highlight some basic choices that we (mostly Alberto) have made on how to model things using RDF.

Choice number one:

No bnodes

A blank node (definition) or "bnode" for short is when you have a subject in RDF that doesn't have a 'real' URI.

Where do you use bnodes?

You encounter bnodes when modelling things in a 'normal' object-oriented fashion, and especially a lot in 'normal' modern XML. For example, the XML document

Listing 1
<group-list xmlns="http://example.org/lsd/2007/how-joost-models-rdf.html#">
  <group id="foo" name="Mentioned in article">
    <people>
      <person id="leo" name="Leo Simons"/>
      <person id="alberto" name="Alberto Reggiori"/>
    </people>
  </group>
</group-list>

might be turned into RDF as

Listing 2
@prefix : <http://example.org/lsd/2007/how-joost-models-rdf.html#> .
leo      isA               Person ;
         name              "Leo Simons" .
alberto  isA               Person ;
         name              "Alberto Reggiori" .
foo      isA               Group ;
         name              "Mentioned in article" ;
         containsPeople    ( leo alberto ) .

which is a special Notation3 (or Turtle) shorthand for

Listing 3
@prefix : <http://example.org/lsd/2007/how-joost-models-rdf.html#> .
leo      isA               Person ;
         name              "Leo Simons" .
alberto  isA               Person ;
         name              "Alberto Reggiori" .
foo      isA               Group ;
         name              "Mentioned in article" ;
         containsPeople    _:1 .
_:1      rdf:first         leo ;
         rdf:next          _:2 .
_:2      rdf:first         alberto ;
         rdf:next          rdf:nil .

_:1 and _:2 are bnodes. Doesn't seem to be a problem with this, does there? (aside from rdf collections being cumbersome)

How does it look without bnodes?

Well, consider this alternative:

Listing 4
@prefix : <http://example.org/lsd/2007/how-joost-models-rdf.html#> .
leo      isA               Person ;
         name              "Leo Simons" .
alberto  isA               Person ;
         name              "Alberto Reggiori" .
foo      isA               Group ;
         name              "Mentioned in article" ;
         containsPerson    leo ;
         containsPerson    alberto .

It consists of less triples, obviously meaning less storage space, and, given the nature of RDF databases today, also better performance. As a data model grows in complexity, it seems that the percentage of bnodes will normally grow a bit as well, so the effect is more pronounced for lots of data.

But, more importantly, the software you have to write becomes more involved. Let's investigate.

The effect of bnode use on source code

Here's some imaginary java code (using jena) that prints certain data it finds in the model:

Listing 5
...
private String[] getName(Model m, Resource r) {
  List<String> names = new ArrayList<String>();
  for(RDFNode nameNode : m.listObjectsOfProperty(r, Example.name)) {
    if(!nameNode.isLiteral()) {
      continue;
    }
    Literal nameLiteral = (Literal)node.as(Literal.class);
    try {
      names.add(nameLiteral.getString());
    } catch(DatatypeFormatException e) {
    }
  }
  returns names.toArray(new String[names.size()]);
}

private boolean hasType(Model m, Resource toCheck, Resource expectedType) {
  return m.contains(m, toCheck, RDF.isA, expectedType);
}

private void printHeader(Model m, Resource groupResource) {
  String[] groupName = getName(m, groupResource);
  if(groupName.length > 0) {
    for(String name : groupName.length) {
      System.out.println("Group name: " + name.getString());
    }
  } else {
    System.out.println("Group name: <unknown>");
  }
  System.out.println("----");
}

private void printName(Model m, Resource personResource) {
  String[] personName = getName(m, personResource);
  if(personName.length > 0) {
    for(String name : personName.length) {
      System.out.println("  " + personName.getString());
    }
  } else {
    System.out.println("  <unnamed person>");
  }
}

public void printInfoAboutGroup(Model m, URI groupId) {
  ValidationUtil.checkNotNull(m, "m");
  ValidationUtil.checkNotNull(groupId, "groupId");
  
  Resource groupResource = m.getResource(groupId.toString());
  if(groupResource == null) {
    System.err.println("Warning: no such group: " + groupId.toString());
    return;
  }
  
  if(!hasType(m, groupResource, Example.Group)) {
    System.err.println("Warning: not typed as a group: " + groupId.toString());
  }
  
  printHeader(m, groupResource);
  
  //
  // NOTE: for loop in a for loop in a for loop
  //
  for(RDFNode groupHead : m.listObjectsOfProperty(groupResource, Example.containsPeople) {
    if(!groupHead.canAs(Container.class)) {
      System.err.println("Warning: Group "+groupId.toString()+" containsPeople points to a non-container");
      continue;
    }
    Container container = groupHead.as(Container.class);
    
    for(RDFNode peopleNode : container.iterator()) {
      if(!peopleNode.isResource()) {
        System.err.println("Warning: Group "+groupId.toString()+" rdf:next points to a literal: " +
          ((Literal)peopleNode.as(Literal.class)).getLexicalForm();
        continue;
      }
      Resource peopleResource = (Resource)peopleNode.as(Resource.class);
      if(!hasType(m, peopleResource, Example.Person)) {
        String identifier = (peopleResource.isAnon())?
            peopleResource.getId().getLabelString() :
            peopleResource.getURI();
        System.err.println("Warning: not typed as a person: " + identifier);
      }
      printName(peopleResource);
    }
  }
}
...

Here's the printInfoAboutGroup() method again, now for the RDF model structure from listing 4.

Listing 6
...
public void printInfoAboutGroup(Model m, URI groupId) {
  ValidationUtil.checkNotNull(m, "m");
  ValidationUtil.checkNotNull(groupId, "groupId");
  
  Resource groupResource = m.getResource(groupId.toString());
  if(groupResource == null) {
    System.err.println("Warning: no such group: " + groupId.toString());
    return;
  }
  
  if(!hasType(m, groupResource, Example.Group)) {
    System.err.println("Warning: not typed as a group: " + groupId.toString());
  }
  
  printHeader(m, groupResource);
  
  //
  // NOTE: for loop in a for loop
  //
  for(RDFNode personNode : m.listObjectsOfProperty(groupResource, Example.containsPerson) {
    if(!peopleNode.isResource()) {
      System.err.println("Warning: Group "+groupId.toString()+" rdf:next points to a literal: " +
        ((Literal)peopleNode.as(Literal.class)).getLexicalForm();
      continue;
    }
    Resource peopleResource = (Resource)peopleNode.as(Resource.class);
    if(!hasType(m, peopleResource, Example.Person)) {
      String identifier = (peopleResource.isAnon())?
          peopleResource.getId().getLabelString() :
          peopleResource.getURI();
      System.err.println("Warning: not typed as a person: " + identifier);
    }
    printName(peopleResource);
  }
}
...

I suspect that, if you haven't seen RDF-inspecting source code before, all of the above looks a little scary. There's a load of looping and checking that you don't have to take into account when using simple javabeans. This is the price to pay for the open world assumption, though of course a lot of it can be abstracted out in utility code a lot better than I've done above.

However, in the midst of all that java fluff, the difference should still be clear -- Listing 6 has one nested for loop less than Listing 5. No matter how much you clean up this code, that fundamental difference remains, and, because of the open world assumption, it is rather more important...compare...

Conclusion

Because of the open-world assumption, making use of bnodes is very expensive when doing real-world software development. Therefore, bnodes should be avoided. Compare:

  • Object oriented world: foo.bar.getXyz() vs. foo.getBarXyz().
  • XML world: foo.getElementsByTagName("bar").getAttribute("xyz") vs. foo.getAttribute("barXyz")
  • RDF world: for(foo) { for(bar) { for(xyz) { ... }}} vs. for(foo) { for(barXyz) { ... }}.

You can forget all of the above, just remember these rules:

  • Don't use RDF collections. Use one-to-many properties that result in "collections" instead.
  • If you need ordering, define the sorting algorithm instead of putting the ordering in your data.
  • If you have (sort-of) one-to-one relationships in your model, and one or both sides of the relationship is identified by a bnode, merge the concepts into one and distinguish using properties.