grack.com

UPDATE: I’ve written a PubSubHubbub-to-XMPP gateway that solves some of the issues of running a real-time feed reader behind a firewall.

UPDATE 2 rssCloud has a serious vulnerability that needs to be addressed in the protocol. I’ve linked some security recommendations here that rssCloud hubs should implement as soon as possible.

These last few months have brought us not one, but two RSS-to-real-time protocols: PubSubHubbub and rssCloud. While rssCloud has been “around” for a while, it never saw much adoption or interest until recently.

As a developer, the important question is: which of these two protocols should I focus on?

When you compare the two protocols technically, you find that there are some similarities (UPDATE: see here for a more in-depth comparison of the APIs):

  • Both PubSubHubbub and rssCloud allow the hub to live on a different server than the server that is providing RSS. This lets the complexity of both of these protocols to live in a black box somewhere else, managed by someone who cares more about getting the details right.
  • Both offer a fairly simple publisher “ping” notification for publishers. An rssCloud client can make a simple POST request to the specified cloud server, which is then verified by the server to ensure that the update was real (alternatively, rssCloud can use XML-RPC or SOAP, neither of which are in fashion right now). PubSubHubbub has a very similar POST operation with very similar semantics.
  • Both offer simple APIs on the hub for subscribing to feeds. PubSubHubbub offers an unsubscribe option, while rssCloud times out subscriptions after 25 hours (the client is expected to re-subscribe after 24).

There are some significant differences between the two protocols, however:

  • PubSubHubbub supports RSS and Atom out of the box. rssCloud does not support Atom right now, as noone has defined how it would look inside of an Atom feed.
  • PubSubHubbub provides “fat pings” to clients, while rssCloud only provides basic notification updates. A PubSubHubbub subscriber can keep tabs on a feed entirely through the ping notifications, allowing it to skip polling of any feed that supports the update protocol. rssCloud requires the subscriber to re-poll the feed after receiving a ping. The “fat ping” has the advantage of saving the feed publisher bandwidth, since clients aren’t downloading the same repeated feed entries time after time, and potentially CPU cycles, since the feed publisher only has to generate a single feed output for the hub rather than for all of its clients (this can be mitigated by caching the generated feed). The fat ping requires more work on the part of the hub, however, as it needs to detect which parts of the feed have changed and push those parts into the subscriber notification dispatch queue.
  • PubSubHubbub lets you subscribe any endpoint you like (with some intelligence to prevent you spamming pings to arbitrary hosts). rssCloud infers your endpoint hostname from the IP address of the request, requiring your subscription logic to live on the same servers as your ping endpoints.

Back to the question: which of these protocols should I focus on? The answer probably depends on what you are doing.

  • If you are a publisher that publishers both RSS and Atom feeds, it’s trivial for you to support pinging rssCloud and PubSubHubbub hubs. There’s nothing stopping you from doing it now - just figure out which hubs to use. If you use FeedBurner and PingShot, Google has already cloud-enabled your blog for you.  If you want to control your own hub, you’ll probably want to pick an off-the-shelf one. PubSubHubbub is likely the best choice here as it both saves you bandwidth and gets you real-time support in FriendFeed.
  • If you are planning on writing a hub, you’ll probably want to start with rssCloud. Its implementation will be simpler than PubSubHubbub as all it does is redistribute ping notifications.
  • If you are a feed reader or a content spider, you’ll probably have to implement both. I believe that PubSubHubbub gives you the biggest bang for the buck now, as it’s supported by nearly all of the Google feed properties: FeedBurner (the Atom/RSS intermediary choice for a significant number of self-hosted blogs), Blogger (millions of blogs) and Google Reader feeds. It’s also supported by LiveJournal (which lists 20+ million blogs on its homepage).  rssCloud is fairly new, but it managed to score a big integration with wordpress.com (7.5 million blogs, according to their own blog). Unfortunately, as not all of the big sites have implemented both, you’ll have to deal with two technologies for the time being.

After researching both of the technologies in-depth, I’d say that PubSubHubbub is the better technology overall.  While more complex to implement for hubs, it offers far more to feed readers and publishers in terms of bandwidth savings and real-time updates.  For companies doing content analysis, PubSubHubbub is a huge win: it brings the power of the Twitter firehose to RSS. No matter which technology you choose, however, you’ll be getting your RSS feed updates far more often.  It might even allow the next real-time technology to be built on an open XML feed rather than a proprietary company’s servers.

Read full post

I’ve updated my prototype to use the excellent Rome feed parser library. Instead of dumping 20kB of ‘useful’ raw feed on you, it now formats the entries nicely.

I’ve hooked it up to deliver me real-time headlines from my Google Reader feed and from TechCrunch, both of which work flawlessly.

With all the building blocks I’ve strung together, this really wasn’t any work at all. All of the complexity lies in the cloud: Google’s AppEngine and XMPP implementation and the PubSubHubbub hub. The rest is done with a feed-parsing library.

Here’s the new code:

package com.grack.pubsubhubbub.xmpp;

import java.io.IOException;
import java.util.List;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import com.google.appengine.api.xmpp.JID;
import com.google.appengine.api.xmpp.MessageBuilder;
import com.google.appengine.api.xmpp.XMPPService;
import com.google.appengine.api.xmpp.XMPPServiceFactory;
import com.sun.syndication.feed.synd.SyndEntry;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.io.FeedException;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.XmlReader;

@SuppressWarnings("serial")
public class Subscribe extends HttpServlet {
    @Override
    protected void doPost(HttpServletRequest req, HttpServletResponse resp)
            throws ServletException, IOException {
        resp.setStatus(204);
        XMPPService xmpp = XMPPServiceFactory.getXMPPService();
        JID jid = new JID(req.getPathInfo().substring(1));

        SyndFeedInput input = new SyndFeedInput();
        SyndFeed feed;
        try {
            feed = input.build(new XmlReader(req.getInputStream()));
        } catch (IllegalArgumentException e) {
            throw new ServletException(e);
        } catch (FeedException e) {
            xmpp.sendMessage(new MessageBuilder().withBody(
                    "Feed exception: " + e.toString()).withRecipientJids(jid)
                    .build());
            throw new ServletException(e);
        }

        @SuppressWarnings("unchecked")
        List entries = feed.getEntries();

        StringBuilder message = new StringBuilder("Got update: \n");
        for (SyndEntry entry : entries) {
            message.append(entry.getTitle()).append(": ").append(
                    entry.getLink()).append('\n');
        }
        xmpp.sendMessage(new MessageBuilder().withBody(message.toString())
                .withRecipientJids(jid).build());
    }

    public void doGet(HttpServletRequest req, HttpServletResponse resp)
            throws IOException {
        resp.setStatus(200);
        resp.setContentType("text/plain");

        XMPPService xmpp = XMPPServiceFactory.getXMPPService();
        JID jid = new JID(req.getPathInfo().substring(1));

        if (req.getParameter("hub.mode").equals("subscribe"))
            xmpp.sendMessage(new MessageBuilder().withBody(
                    "Subscribing to " + req.getParameter("hub.topic"))
                    .withRecipientJids(jid).build());
        else
            xmpp.sendMessage(new MessageBuilder().withBody(
                    "Unsubscribing from " + req.getParameter("hub.topic"))
                    .withRecipientJids(jid).build());

        resp.getOutputStream().print(req.getParameter("hub.challenge"));
        resp.getOutputStream().flush();
    }
}
Read full post

At first glance, both rssCloud and PubSubHubbub have an interesting shortcoming that makes them difficult to use for desktop feed readers. Since both of them require HTTP callbacks to a publicly accessibly endpoint, a user is required to open up a port on their firewall.

It turns out that a subtle difference in the specifications gives PubSubHubbub a big edge in this case. While rssCloud requires your callback endpoint to live at the IP address you make your request from, PubSubHubbub allows you to subscribe any endpoint you wish by specifying a hub.callback url.

So how do we turn this into a real-time feed for desktop clients? Simple: we implement a PubSubHubbub subscriber on a publicly-available, always-on server that receives PubSubHubbub update events and wraps them in XMPP. The XMPP events are transmitted to the desktop client, where it can then process them as if it received the callbacks directly.

The server application doesn’t need to be smart. Only the “subscribe” and “publish” modes of PubSubHubbub’s protocol are required. All it needs to do is correctly route the update subscriptions to the correct XMPP account. In fact, with Google AppEngine’s new XMPP support, you can this in a few dozen lines of code, as I’ve done here:

A PubSubHubbub to XMPP gateway, hosted on Google AppEngine

Try out the gateway by entering your XMPP ID on the main page. This will give you a callback URL that you can use on Google’s main PubSubHubbub hub. Enter the URL for any PubSubHubbub-enabled field as the topic.

The code is simple, though not very robust:

@SuppressWarnings("serial")
public class Subscribe extends HttpServlet {
    @Override
    protected void doPost(HttpServletRequest req, HttpServletResponse resp)
            throws ServletException, IOException {
        resp.setStatus(204);
        XMPPService xmpp = XMPPServiceFactory.getXMPPService();
        JID jid = new JID(req.getPathInfo().substring(1));

        byte[] buffer = new byte[10 * 1024];
        req.getInputStream().read(buffer);
        xmpp.sendMessage(new MessageBuilder().withBody(
                "Got update: " + new String(buffer))
                .withRecipientJids(jid).build());
    }

    public void doGet(HttpServletRequest req, HttpServletResponse resp)
            throws IOException {
        resp.setStatus(200);
        resp.setContentType("text/plain");

        XMPPService xmpp = XMPPServiceFactory.getXMPPService();
        JID jid = new JID(req.getPathInfo().substring(1));

        if (req.getParameter("hub.mode").equals("subscribe"))
            xmpp.sendMessage(new MessageBuilder().withBody(
                    "Subscribing to " + req.getParameter("hub.topic"))
                    .withRecipientJids(jid).build());
        else
            xmpp.sendMessage(new MessageBuilder().withBody(
                    "Unsubscribing from " + req.getParameter("hub.topic"))
                    .withRecipientJids(jid).build());

        resp.getOutputStream().print(req.getParameter("hub.challenge"));
        resp.getOutputStream().flush();
    }
}

Postscript: I really hope that PubSubHubbub gets a new name.

Read full post

I caught the audio of the rssCloud get-together in Berkeley tonight and it was very enlightening.

One of the first points brought up was the problematic subscription API. The subscription API requires that the endpoint live at the the same IP address as the system making the subscription request. Dave Winer’s response was basically “we can’t and won’t change the protocol because it’s too widely deployed”. He asked that anyone who wanted to change this fork the protocol. Unfortunately, the lack of flexibility in assigning endpoint URLs makes this a very difficult sell for larger organizations where outgoing and incoming HTTP requests are routed differently. I think this was a big mistake on Dave’s part here. There’s a great opportunity to fix the glaring holes in the protocol (those 7.5 million blogs on Wordpress run the same WP rssCloud plugin that I do).

Some other interesting points were brought up, such as the lack of a block button (useless in a distributed web) and ideas for distributed identity. Tunneling was quickly brought up, but the discussion moved on just as quickly.

At one point, someone asked about rssCloud support in Atom. Dave Winer suggested using a namespace for the element and some discussion took place on it. I’m not sure who brought it up, but it will likely be blogged and Dave will point to it to make it official. Noone brought up the already-specified <link rel= tag as an alternative, unfortunately.

Another interesting item brought up was that Wordpress will be supporting Pubsubhubbub as well at some point in the future. It was more convenient to support rssCloud first, so they went with it. Dave Winer and Matt Mullenwag joked that when they do, “just don’t say rsscloud is dead”.

More notes are available from @susiewee’s live twittering.

Read full post

Copy of an email I sent to http://tech.groups.yahoo.com/group/rss-cloud/messages:

I’ve been looking into the security of rssCloud over the last few days and a number of serious issues have come up. The ones of immediate importance have been reported and fixed, but I’d like to suggest some protocol changes to prevent rssCloud servers from DoS’ing sites or ending up DoS’ing each other.

Here are my implementor guidelines, based on the research over the last few days. These guidelines should mitigate all of the problems I’ve found so far without requiring rssCloud subscribers to make major changes to their code:

  1. An rssCloud implementation MUST validate that the Content-Type of the POST is application/x-www-form-urlencoded.
  2. All rssCloud parameters MUST be read from the body of the HTTP POST (ie: $_POST in PHP or equivalent in other languages). Parameters in the querystring must be ignored.
  3. The path parameter for the callback path MUST begin with a /.
  4. The port parameter, if specified, must be converted to an integer value before constructing the callback URL with it. Any trailing non-digit characters must cause the hub to return a 5xx error.
  5. The subscription callback and all subscription pings MUST include a “challenge” parameter (inspired by PubSubHubbub). The subscriber MUST respond with a response that contains the contents of that challenge parameter as the body of the response. No additional information may appear in the body.

I haven’t taken a look at the SOAP/XML-RPC parts of the protocol yet, but the extra framing should help make them more secure. Someone more familiar with those technologies should run those through their paces.

Feel free to contact me with any rssCloud hub implementation you might have and I’ll run my battery of tests against it.

Read full post

UPDATE: There’s a new domain parameter in rssCloud that makes this DDoS far, far worse.  Since there’s no verification (yet) on rssCloud endpoints, you can now subscribe any server to any rssCloud hub’s notifications.

While researching some of the issues of rssCloud running in a shared hosting environment, I came across a serious vulnerability in the protocol. The vulnerability allows someone to cripple a shared web host. Because of the sensitive nature of this vulnerability, I’m not going to share example code or which shared host(s) are vulnerable.  The fix is easy: follow these security recommendations to close the hole.

The inspiration for this vulnerability was discovered by Nick Lothian’s post on FriendFeed. It turns out that many shared hosting providers route incoming and outgoing HTTP requests through different IP addresses. The process of routing the HTTP requests is usually done transparently by a networking gear outside of the web servers themselves.

rssCloud’s specification infers the endpoint from the REMOTE_ADDR CGI variable at the time of the subscription. It would be very difficult to get an rssCloud subscriber working in a shared hosting environment because every subscription request you make goes out on IP address A, but all of your incoming requests come in via port 80 on IP address B. For some shared web providers, the machines that make outgoing requests are also web servers, serving banner messages or redirects to sales sites. Because they are web servers, they are considered valid rssCloud REST endpoints (returning 200 OK for POST requests on some URLs).

When you put these pieces together, it becomes readily apparent that you can now subscribe your shared host’s outgoing HTTP request IP address to any number of feeds. Considering that Wordpress has 7.5 million blogs that speak rssCloud, there’s a significant number of blogs that could end up pinging the machine.

There are probably a number of other interesting vulnerabilities in this area, such as traffic that travels through a proxy, or an anonymizing service such as TOR. It may be possible to knock one of these offline by subscribing it to a large number of feeds.

The problem with rssCloud is that its subscription request only proves that you can make requests via the given IP address, not that the given IP address is willing to receive them. By adding the challenge parameter I suggested in the previous post, you can now guarantee that the endpoint is willing to receive these requests, making it much harder to subscribe an unwilling participant in the protocol.

Read full post

Robert Scoble’s accidental tweet (“Frnégtttrdre”) earlier tonite caused a minor ripple: people wondering if he was announcing a secret project, under the influence of alcohol or wandering around with an unlocked iPhone in his back pocket.

It also makes for an interesting test for real-time search.

An hour after his tweet:

Anyone else I’ve missed?

Conclusions: If you tweet a random word, there’s no guarantee that you’ll get indexed right away.  Additionally, not every service tests unicode querystring parameters.

Read full post

Is it time to the world to move on from RSS and to its successor, Atom? Some considerations:

Atom has an IETF standard for syndication. Atom has an IETF standard for publication. Atom was designed for modularity. Atom supports rich, well-defined activities within feeds.

RSS is effectively frozen at 2.0:

RSS is by no means a perfect format, but it is very popular and widely supported. Having a settled spec is something RSS has needed for a long time. The purpose of this work is to help it become a unchanging thing, to foster growth in the market that is developing around it, and to clear the path for innovation in new syndication formats. Therefore, the RSS spec is, for all practical purposes, frozen at version 2.0.1. We anticipate possible 2.0.2 or 2.0.3 versions, etc. only for the purpose of clarifying the specification, not for adding new features to the format. Subsequent work should happen in modules, using namespaces, and in completely new syndication formats, with new names.

It is full of legacy tags and archaic design decisions:

The purpose of the <textInput> element is something of a mystery. You can use it to specify a search engine box. Or to allow a reader to provide feedback. Most aggregators ignore it.

We are spending all this time duplicating effort. Every feed reader needs to deal with Atom and RSS. Every blog provides an Atom feed and an RSS feed. Users trying to subscribe to blog feeds are presented with an unnecessary choice.

RSS solved a need at the time, even though it was crufty and difficult to use and difficult to parse (remember when RSS XML didn’t have to be well-formed XML?). It served as an inspiration for millions of sites to open up their content to new methods of reading. It inspired a great successor, Atom, which has surpassed it many times over.

We dropped gopher when its time ran out. It’s time to make Atom the primary format for blogs.

Read full post

Javascript has three primitive types: number, string and boolean. You can quickly coerce values between the primitive types using some simple expressions.

There are a few different coersion expressions, depending on how you want to handle some of the corner cases.  I’ve automatically generated a list below:

Conversion: To Number To Number To Number To String To Boolean To Boolean
Expression: +x (+x)||0 +(x||0) ""+x !!x !!+x
null 0 0 0 "null" false false
(void 0) NaN 0 0 "undefined" false false
NaN NaN 0 0 "NaN" false false
Infinity Infinity Infinity Infinity "Infinity" true true
-Infinity -Infinity -Infinity -Infinity "-Infinity" true true
0 0 0 0 "0" false false
"0" 0 0 0 "0" true false
1 1 1 1 "1" true true
"1" 1 1 1 "1" true true
2 2 2 2 "2" true true
"2" 2 2 2 "2" true true
[] 0 0 0 "" true false
({}) NaN 0 NaN "[object Object]" true false
true 1 1 1 "true" true true
"true" NaN 0 NaN "true" true false
false 0 0 0 "false" false false
"false" NaN 0 NaN "false" true false
"" 0 0 0 "" false false
"null" NaN 0 NaN "null" true false

The above table was generated with this code (note: uses some Firefox-specific code).

<table id="results" style="border-collapse: collapse; border: 1px solid black;">
 <tr id="header">
 <th>Conversion:</th>
 </tr>
 <tr id="header2">
 <th>Expression:</th>
 </tr>
</table>

<script>
function styleCell(cell) {
 cell.style.border = '1px solid black';
 cell.style.padding = '0.2em';
 return cell;
}

values = [
null, undefined, NaN, +Infinity, -Infinity, 0, "0", 1, "1", 2, "2",
   [], {}, true, "true", false, "false", "", "null"
]

coersions = [
["To Number", "+x"],
 ["To Number", "(+x)||0"],
 ["To Number", "+(x||0)"],
 ["To String", "\"\"+x"],
 ["To Boolean", "!!x"],
 ["To Boolean", "!!+x"]
]

var results = document.getElementById('results');
var trHeader = document.getElementById('header');
var trHeader2 = document.getElementById('header2');

for (var i = 0; i < coersions.length; i++) {
 var th = trHeader.appendChild(styleCell(document.createElement('th')));
 th.textContent = coersions[i][0]
 th = trHeader2.appendChild(styleCell(document.createElement('th')));
 th.textContent = coersions[i][1]
}

for (var i = 0; i < values.length; i++) {
 var tr = results.appendChild(document.createElement('tr'));
 var rowHeader = tr.appendChild(styleCell(document.createElement('th')));
 rowHeader.textContent = uneval(values[i]);

 for (var j = 0; j < coersions.length; j++) {
 var td = tr.appendChild(styleCell(document.createElement('td')));
 td.textContent = uneval(eval("(function(x) { return "+coersions[j][1]+"})")(values[i]));
 }
}

</script>
Read full post