IA Summit 2015 Main Conference Talk
Topic(s): coding and content modeling
The Gospel of structured content has taken the web publishing world by storm, but all is not well. Pressured by the demands of content reuse in a multi-device world, even lightweight blogging tools now leverage carefully modeled content types with explicit fields and schemas. Unfortunately, it all falls apart when users hit the body field: ugly ad-hoc markup creeps in, house styles evolve without planning, and critical metadata stays locked in blobs of “good enough for now” HTML.
Better HTML-focused WYSIWYG tools aren’t enough, and the principles of semantic HTML don’t solve the deeper problem. The work of content modeling must extend inside the body field, not just wrap around it.
In this session, we’ll discuss the projects where this issue is most frequent encountered, see how three different CMSs are tackling the problem, and learn how to apply the lessons of XML and DITA with modern database-driven web CMSs.
About the speaker(s)
Jeff Eaton is Senior Digital Strategist at Lullabot. Autodidactic intersectionalist, content strategy ingenue, software architecture ne’er-do-well, and generally opinionated snark.
Jeff Eaton: Hello! I like the title for this because the body field is something that’s incredibly mundane in most projects and often ignored. I like the idea of it being the staging ground for some sort of pitched battle for the future of content.
What this is about is basically the tug of war between the need for smooth narrative flow. The stuff that a writer can create when they sit down and they want to create a narrative about something and the need for chunky, future-friendly, well-modeled structure that there are so much pressure force to produce on modern content-oriented projects.
If you saw Bram Wessel‘s talk in the previous slot about micro-content, this is pretty much exactly the opposite. This is about macro-content. It’s about the stuff you can’t just break up into little tiny pieces.
My name is Jeff Eaton. I’m @eaton on Twitter if you want to hear me ranting about the stuff in this very frequently at all hours. I’m with a company called Lullabot. We do what I think of as big web content.
We work with organizations where content isn’t just an adjunct to what they do or a way to promote their services or something like that. It’s actually the heart of what they do. Martha Stewart Living Online, they produce magazines, websites, and recipes. Their content is who they are.
The World Wrestling Entertainment Company also turns out, they are a content company. They produce stories, narratives, long arcs of conflict and stuff like that. That’s what their entire business revolves around.
We just re-launched MSNBC. It’s another one of those organizations that even though you may not think about it, what they do is content, 100 percent. Everything else is just supporting that.
What we find on most of these projects is that for content-centric organizations, there’s an increasing pressure to deliver these four things. One is greater flexibility.
Like everybody’s talking about responsive and multichannel and device proliferation, stuff like that. The content that everyone is working with and producing needs to be flexible enough, to live in all places without having to reinvest in reproducing every each one of those channels.
The second thing that’s needed is more efficient reuse across different campaigns, across time. A story that may have been perfect two years ago needs to be flexible enough to be folded into a new series as adjunct information, stuff like that. The ability to use things across time spans needs to be there.
Also, faster production, a lot of the cool things that people do with content take lots of time. Some of that is nonnegotiable. Like for a news organization, a reporter going out and researching a story is not something that you can trim down with changes to the CMS, but you can make tools that make the actual production of it faster. There’s more and more pressure to support that stuff.
That also means that anything that starts shifting more and more work onto the actual editorial team and the actual writing team is pretty bad. You can’t just toss more complexity onto them and hope that they’ll sort it out.
Finally, this is the kicker that keeps tripping us up. Everyone wants richer narratives. Who’s heard that phrase in a context where the person is saying they knew what they meant by richer narratives?
That demand for stories that are more than just text dumped into a big long stretch of a web page is driving a lot of interesting advances and headaches and late nights in web development teams and editorial teams.
Anyone recognize this? It’s a screenshot from the “New York Times’ Snow Fall, The Avalanche at Tunnel Creek.” A new story that made such waves in the content publishing in CMS world that snow falling became a verb. Three years later, we still read clients saying, “You know, like snowfall.”
One of the reasons that it’s frustrating to hear that brought up is that the Snow Fall article which was something like a dozen different major chapters, all sorts of rich media woven in throughout the story, you can see these sorts of things all sprinkled throughout the key points in the narrative. It was ridiculously expensive to produce.
Tons and tons of actual production resources went into interviews and infographics and stuff like that. When somebody says, “You know, like snowfall.” It’s like, “Well, do you have a quarter of a million dollars to fund that story?” Maybe? Then let’s do that.
Beyond just the resource-intensive production of that particular news story, a lot of the interesting things it did presentationally, as a part of that narrative, break the chunky CMS models that a lot of us had become really accustomed to building on top of, to meet those other three requirements of easier production, faster production, more flexibility.
The goal of creating these interesting rich narratives where that picture isn’t just a picture, it’s an embedded photo gallery. That little sidebar with a person’s name isn’t just something that got tossed in there, it’s first time John is mentioned in the story, his bio floated up there. It’s all these things woven into the flow in a way that breaks the models that we tend to use in chunky well-modeled CMSs.
I’m kidding, but the first thing to accept is this is currently, today, a hard problem with a lot of the tools that we have. The good news is it’s not the end.
I want to do this a little bit backwards because the solutions to this are actually shockingly simple and well-established in the general world of content modeling in IA, in content management, but we’re not terribly used to them because they’ve fallen out of favor over the past few years in web publishing systems.
Explaining why the things that we’re used to don’t work, takes probably 10 times longer than just saying, “What does?” We’re going to start by saying, “What does work?” Then we’re going to go and backtrack through some of the reasons why simpler much more popular solutions right now don’t work as well.
The solution to this problem of embedding complex structure inside of narrative flow effectively. The first step is to use placeholders inside of those narratives instead of either forcing or allowing editorial teams to start pushing more and more blobby, heavy markup inside of their stories.
Instead of allowing that, have placeholders that they can put in there to say, “Put gallery 15 here. Put Dave’s bio in here.” That allows you to capture important information about where in a narrative something occurs without forcing all of the heavy structure that goes along with the markup to represent it into the same narrative.
The second thing is accept that there’s always going to be a transformative step for well-modeled content. One of the things that we’ve gotten used to in the predominantly web publishing-driven side of the CMS world is that the body field stores HTML.
Accepting that well-modeled content will always need a transformative step to go from the language that people enter stuff in into however we want to represent it on the front end saves you a lot of time and a lot of hassle. It frees you from having to put all kinds of complex fundamentally design tools in the hands of an editor who just wants to capture something like, “This needs to go here,” or, “I mean this.”
Let them capture the meaning then transform it into a representation on the output.
Third, don’t rely on HTML for that. I hinted that when talking about the transformative step, but HTML is the language of web browsers. It is awesome for communicating to a web browser what that web browser should do. It is not great for capturing the semantics of your organization’s vocabulary for talking about stories and narrative and content.
There’s a big mismatch there in trying to always force people to talk about it in HTML terms causes a lot of pain.
Finally, again this is hinted out by the not relying on HTML thing, we need to work hard to clarify the vocabulary of our content inside of each organizational project.
One of the nouns and adjectives that are meaningful to us that represent things that we do inside of stories, things that are common patterns that we always lean on when communicating certain things, those are the language in the vocabulary of your content in a same way that the entity relationship diagram in the boxes and arrows and stuff like that represent big chunky content types.
Inside of the body field, there’s just as much of narrative language that can be represented.
This isn’t just for giant big-budget snow fall projects either that we’re seeing this problem. This is a review of the Apple watch from theverge.com. It’s rich with all kinds of crazy stuff that is woven into the flow of the article, but doesn’t necessarily have meaning in terms of what would I enter into the image that goes at the top field? What would I enter into the sidebar? It’s all very custom and they invested a lot in building that.
This is a review of the Xbox One from Polygon. It’s very image rich, but it’s part of a long flow. It’s not just some sort of one off big HTML file that they produced.
Going to more mundane topics, the pope isn’t a mundane topic compared to the Xbox, but this is an article from CNBC. It’s a simpler example, but up there you can see this little-captioned image actually has four or five different little of chunks of data inside of what could otherwise be a simple captioned image.
The fact that it appears at a particular point in the story is important. It’s very easy to start turning that into an HTML editing task for an editor rather than I need the image here.
This is a CNBC article. It has a fairly normal editorial flow, but all of the individual pieces that are in there from read more links at particular points in the narrative to a pro quote that actually has five distinct pieces of information to track who? Where? What their job is? Who said it? Stuff like that.
All of those things are part of their article structure and their article narrative, but they have to live in the body field because the positioning and the flow of it is very important.
That’s really where we face this problem, where these three things occur simultaneously. Narrative text like reports, documentation, long-form news, they have islands of structure inside of them like galleries, pop-up info, data visualizations that illustrate a particular statement, and the placement matters.
You can’t just have a pile of data visualizations at the end or you can’t rely on templating to say, “Oh, at the beginning of the article, place one of the images here.” They actually need to place very specific things at very specific places.
In the classic style, you can choose any two of these in the implementation stasis. Whenever you need all three of them, things get painful.
Why is the placeholders versus inline, lots of markup, why is that important?
In current CMSs, we’ve solved tons of problems by taking the big blobby HTML document and turning it into a nicely chunked up relational database storage with things like title, summary, byline, body, the images for the gallery. All broken out into separate things, that we can track individually. Then we push those things through templates in order to turn them into a natural page. That’s awesome.
The problem is we’ve only really kicked the can down the road. We’ve taken blobbiness and we’ve squeezed it into the summary field and the body field. Everything else, we’ve modeled nicely, but we sort of left the body field as a no man’s land and hope for the best and scolded people when they put too much weird stuff in there, but what are you going to do.
The problem comes when you’ve got something like the photo gallery that doesn’t just need to be an adjunct to the post, but it needs to actually live right there for it to have its meaning. That is full of sense.
This is where the placeholder stuff comes in. Instead of actually putting the full markup for say, a photo gallery or something like that in the article, you model that photo gallery as you would in a fully-chunked out system.
You have upload fields and reordering draggable widgets and stuff like that, but the editorial teams get little placeholder tokens like a gallery to put inside of the flow. Then when it’s transformed, those things can be replaced by the actual final representations.
Now, the actual syntax of those little placeholder tokens is way less important than the concept, that what you’re capturing is enough meaning to turn it into what you will eventually want. The top one is a simple custom XML. The middle one is a WordPress style shortcode. The bottom one is HTML5 data attributes.
All great, but the idea is they become those placeholders and they allow things to be controlled inside of the flow without having to do Dreamweaver in a body field.
The transformation thing is important because one of the things that we often hear is, “Well, this would be solved if we just gave people a better WYSIWYG editor or it would be solved if we force them to use semantic HTML because isn’t that going to solve all of our problems.” No.
I apologize. I started talking about the slide next. Pause. Holds that lead-up, this about the transformation stuff.
The idea that the little placeholder token of gallery ID one gets put in the body field, that’s cool. Now, the transformative step means that in all of the different ways that we end up publishing stuff, we can turn that ID equals one into whatever it needs to be.
On the mobile web, it could just be a title and a link with a little photo gallery icon next to it. On an enhanced full-sized desktop experience, maybe with progressive enhancement or something like that, we could turn it into a full scrolling gallery with the actual images and captions and credits.
In an email version of the story that goes out, it might just be a single image and a caption and a link. A partner API, if someone wants to pull new stories from our site and republish them somewhere, we might strip them out entirely because we don’t know whether the photo rights apply to other people who are republishing.
On a printable PDF, we may just say see page whatever, where that stuff is correlated somewhere else. For our mobile app, that might be just sent out as JSON data with additional photo assets in a separate little pile of information.
The idea is whether we’ve changed the design, whether we need to push it to a new channel, that placeholder token allows us to keep the information where it needs to be but still turn it into any representation we want in there.
As I was saying, don’t rely on HTML. We hear things like semantic HTML or better WYSIWYG editors will solve this problem for us, because they will allow people to do all kinds of rich stuff inside of the body without making it bad blobs, bad HTML.
This is a perfectly semantic HTML representation of a photo gallery. It’s gallery, a list of photos and a caption as simple as possible, but that’s still a whole mess of markup. Not only that. That is not a gallery. That is a particular representation of a gallery that maybe what we’re using today in the design, but what if we decide galleries in the future actually need five photos?
Even a perfect semantic representation in HTML only captures the current browser representation of something, the structure of it that we’re sending to the browser, not the “Galleriness” of it.
That’s why semantic HTML, while it’s great for pushing things to the browser and ensuring that our friend and designers aren’t constantly fighting crappy, horrible HTML, it’s great for that but it’s not in the language of our content.
We need things like this is a teaser, this is a new chapter, these are related stories, this an author bio or photo credit, or this is a promoted element. What we have are things like a side section, an ordered list, paragraphs. They are representation language for the browser, not the essence of what our content is.
That’s where we get to this idea of clarifying our content vocabulary. This is an organization challenge in the same way that modeling content requires lots of planning and lots of thinking about what things we’re going to be creating and using.
This is about going into the body field and starting to think about what the narrative elements that we’re using commonly are and naming them and starting to treat them as real things that we keep track of and we respect as important representations of our content.
This is an example of a data pick from BBC’s site. That’s a particular content element that they started using frequently. It’s a photo, a title, one to five pieces of statistical information and a credit. They have a simple HTML representation of that and they treat it as a unit of information that they can weave into narratives wherever they want. That emerged as one of their narrative mounds.
It’s not an article. It’s always a piece of an article that live somewhere in it.
The recap. Use placeholders instead of allowing people to blob stuff into the body field when a structure is needed. Always accept that there’s going to be a transformative step with those narrative structures. Don’t rely on HTML for it. We can learn a lot from the XML and data communities from the work that they’ve done in modeling semi-structured text and narratives and documentation, stuff like that.
Then start taking on the hard work of clarifying the organization’s content vocabulary inside of narrative. How do they talk about things? What elements do they come back to, to illustrate or send important messages, stuff like that?
This is actually an example of the custom XML that major news organization uses to store their stories. They do pro quotes with three sub-elements. They track assets separately. Then they have very simple HTML with embed codes like inline asset or company name or a stock price that they use inside of it.
They give their editors a WYSIWYG editor that only lets them use the vocabulary of their content, not full HTML. Then they transform it on output to the stuff that we saw earlier. It’s given them tons of flexibility and it’s worked really well for them.
When it works, it turns into a virtuous cycle. The editorial, development and design teams can have a shared common vocabulary that they use to both develop, plan and iterate on the stuff that they’re building.