THE 100 YEAR WEB Steven Pemberton Transcript of keynote at Balisage 2018. Slides at https://homepages.cwi.nl/~steven/Talks/2018/07-31-balisage/ Video at https://www.youtube.com/watch?v=jl4fnY4BjEY So I was going to tell you something about myself but it sounds like you've heard most of it. But the one thing I would like to mention is that my tutor at university was Dick Grimsdale, who I don't think gets enough attention in this world, because he built the world's first transistorized computer and he was himself a tutee of Alan Turing. Which makes me a grand-tutee of Alan Turing. [laughter] And quite coincidentally I went on to work at Turing's old department in Manchester and I worked on the software for the computer number 5 in the series that Turing himself worked on. He didn't work on number one, he worked on number two and I think before his death he got to work on number three. I was also the fifth person on the internet in Europe and that's because the guy in the office next to me set up the internet in Europe and he knocked on my door and said "The Internet's up" [laughter] So I logged in to a computer in New York and said "oh yeah it's working, great, thanks!" and that was the moment thirty years ago in November that the internet started running in Europe. This is a typical project meeting, that's Guido van Rossum in the middle. He's the man who's responsible for Python. That's me on the edge if you don't recognize me. That's me talking to Tim Berners-Lee. So let me start with with an event in 1379 New College Oxford, it's called New College because it was only built in the 1300's. You may have heard this story before, it comes from the great Stuart Brand it's from his book "How Buildings Learn". And so New College was built in that year and this picture is from the Victorian times. Sometime around that time they decided to check those oak beams at the top of the dining hall there and they discovered that they were rotting and they needed new oak beams to replace those. Now the problem is finding oak beams is difficult. So they thought "What should we do?" and they said "Well the university's got a forester, let's go to him first". So they went to the Forester and they said "We need some oak beams" and so he said "Which college are you from?" and when they said "New College" he said "Well I've got your trees." And they said "What do you mean 'our' trees?" "Well, when they built the college they planted new trees ready for when you would need them in a few hundred years' time" [laughter] That's the sort of thinking that we don't really see much nowadays. [laughter, applause] So what I'm hoping to do is just plant one of those seeds in our collective minds about the hundred year web. So here we are in 2018. It's actually 60 years actually 61 years since what I consider the start of public computing, when the first municipality in the world installed a computer. It's 50 years this year since the introduction of the programming language Algol 68. You may never have used it, you may not even have heard of it but if you use the word 'dereference' then Algol 68 has touched your life. [laughter] And in fact I think Algol 68 has had much more effect than you realize: I can trace a direct line from Algol 68 to Python and show that how Python's developed and this is partly as a result of Algol 68. It's 30 years, as I just said, since the introduction of the open internet in Europe and so therefore the internet became truly international and this year of course is 20 years of XML, since the spec got published. So 2018 is an interesting year in lots of ways. And it's also interesting for me to think that we consider the internet still as something new, I would claim but it's 30 years old but when the internet was first switched on computing was only 30 years old. So it's surprising actually how new everything really is. So XML: I'm here to praise XML, not to bury it [laughter] and that's actually part of my purpose here I am here to bury something else as you will shortly see. XML is great stuff and the good thing about it is that it shows none of the second-system effect. It was actually a reduction of something before, picking out the good stuff. Admittedly there are some small mistakes but basically good solid design and as I say I'm not going to complain about it. It's given us a good document model it's given us an excellent tool chain and it's given us modularization. Let me just show you something: this is in 2002. So I was responsible - that is I chaired - the XHTML working group and in 2002 we were already demonstrating XHTML2 + SVG + MathML together in one document in the browser. This was long before HTML5 which seems to think that they are the people who actually brought these things together in the browser; that's absolutely not true and this is thanks to XML that allows you to combine stuff together in one document in a useful way; it produced the modularization that we needed. A lot of people at that time were complaining about XHTML that it was adding so little new stuff, so few new elements in XHTML and that was because we didn't need to because other people were doing that, they were producing other namespaces that you could include in XHTML and you could get fantastic documents. We didn't have to do any work we let the experts do the stuff on graphics and the experts do the stuff on maths. So now, we've got a new web since then. We've got HTML5, it's changed the web. Some parts of it are good although I ... wouldn't be able to tell which parts at the moment. [laughter] I'm sure that if I think hard enough I will find something. But mostly it is based on a lack of proper design and a lack of understanding of the design principles and how to design notation. [clapping] Just you wait! [laughter] I don't believe that HTML is leading the web to its full potential. The central element of HTML that made it so successful I will claim, is "declarative" and we're going to hear this word several times this week. 'Declarative' is where you say what you want rather than saying how you want to get it. Now at school we actually learn what numbers are, and how to add them, and how to subtract them and how to multiply and divide them but when we get to square roots we're told the square root of a number n is the number such that when you multiply it by itself it gives you the original number again. Nobody at school learns how to calculate a square root except to press that button on the calculator but everybody knows what a square root is. So this this definition, it tells you, it allows you to recognize a square root, but it doesn't tell you how to calculate it. Now if I show you the code to calculate square root if I hadn't told you that this was square root probably most of you wouldn't have been able to tell me that this was the code for square root because it's actually complicated code. I actually know the theory behind this and I find it very hard to pluck the theory out of this to see how this matches the theory because actually it's been optimized a bit and lots of bits have been taken away so that the theory part has vanished. So what does it do? Under what conditions does it do it? How does it do it? What's the theory? Is it correct? Can I prove it? Can I change anything here? What do the different things mean? It's hard code and this is why procedural code is so hard because it doesn't match up with the definition you have to translate the problem into the solution and there's no link between the two. So the great thing about declarative is that * it's much shorter * it's easier to understand * it's independent of the implementation * it's less likely to contain errors * it's easier to see it's correct, * and it's tractable. So here is an example of how you would define numbers just the concept of numbers. So a number is an optional sign followed by one or more digits and an optional sign is an optional minus sign and a digit is those things and a number just has it's normal everyday meeting Now this is short, it's easy to read and interpret, it uses a standard, widely-used notation and it's tractable: you can actually feed this into a program which will automatically generate a parser for it. Now let's look at HTML5. This exact same problem is completely described in text and if you read it, it's actually a computer program. It says "let input be the string being parsed" "let position be a pointer into the input, initially pointing to the start of the string" and so on. It's actually just a computer program turned into English and the weird thing is that if you see a plus character well they'll accept it anyway: "the plus is ignored but it's not conforming" so weirdly it's both conforming and non-conforming simultaneously. So the HTML5 definition of signed numbers is 16 times longer and has internal inconsistencies and it will not surprise you to know that the HTML spec is very large. [Sounds of disbelief] I mean you know I'm not going to print out myself. Somebody else had done it, took this photograph this is 2009, it's probably larger now but this is good enough. So I would say the HTML5 is almost but not quite entirely not about markup. You can tell by reading the spec and just the looking at that little bit I just showed you that it was written by programmers and that's because when your only tool is a hammer all your problems look like nails. And yet the weird thing is, they're programmers and they forgot about how to use libraries. The HTML5 spec is just one huge monolithic program. Now, declarative is not only about specifications it's also about markup itself HTML used to be about being declarative I think the poster child is the element which encapsulates enormous amount of behavior in a very small space. It tells you what the link looks like, what to do when you hover over it how to activate the link in several different ways what what you should do with the result it gives you hooks for presentation changing and actually doing this with programming would be an awful lot of work. It's just a single line that would be many tens of lines of code if you wanted to do this programmatically. CSS is another example of a successful declarative approach. Netscape at the time that CSS was started at W3C Netscape declined to join the group because they said they had a better solution JSSS, where the J stands for JavaScript and so that instead of writing a line like that giving this font size you would just use a line of JavaScript "document.tags.H1.fontSize = 20pt" Wikipedia says about JSSS: "it lacks the various selector features of CSS supporting only simple tag names but because it's based on a complete programming language it can include highly complex dynamic calculations and conditional processing." Which is not necessarily a good thing. So this takes us to JavaScript, here's the definitive guide and the thing about JavaScript is that it is so good that it even has good parts. [laughter] [voice off:] That is a very thin book. [laughter and applause] Thank you for that. Indeed, the good parts are very small parts and in fact if you're going to have to deal with JavaScript I recommend just using that book and ignoring the other. But even the good parts have bad parts because if you read "JavaScript, the good parts" it's peppered with all sorts of things like "if the operand of typeof is an array or null then the result is 'object' which is wrong" [laughter] So apparently even some of the good parts are bad. One of the big problems with JavaScript is that debugging is really hard and that is because of the misuse of the robustness principle which if you were here yesterday you would've heard this talked about yesterday Also known as Postel's law and it says "be conservative in what you do, or produce, and be liberal in what you accept from others" Now it's a good rule in certain places but not everyone thinks it's necessarily a good rule I don't think it necessarily is and I'm not the only one. These slides will be online you can click on that link to read more about it. I think it's had a bad effect on the web because thanks to being liberal browsers accept all sorts of junk but that's the method that people use to know if they've done something right: they put it in the browser and if it looks good then they think that's right and then they stop. So there's no way they can know what is good and what's not good, they just look at it in the browser and then they leave it as it is. So it produces "suck it and see" coding and so we get a load of junk as a result. It's bad because the message you think you are sending is not necessarily the message being received because it's being interpreted liberally. An example of why it's not good is that it can take you hours to find out why your CSS is not working and only to discover that the browser thinks the HTML is different than what you think it is because it's interpreted differently. So the robustness principle was proposed in order to improve interoperability between programs: "Do you really want to deal with those fools? Better to silently fix up their mistakes and move on." But it should never be applied to programming languages! Look at this string. This is the robustness principle at work this enormous string evaluates to the string "10" and again you can click on this to find out why but you know it doesn't make sense it doesn't make sense; and you've got a bug in your program, it's very hard to get back to find out what it was that caused it. [voice off:] It actually does! It's horrifying! [laughter] It actually does, and it is actually horrifying. It's worth knowing that 90 percent of the cost of software comes from debugging, so what you need is systems that help you by reducing the need for debugging and not by producing more of it. It turns out that parsing JSON is a minefield and this should give us a very good feeling within the XML community. This guy decided to write a test suite for JSON, and he says "I'll show you that JSON is not the easy, idealized format as many believe indeed I did not find two libraries that exhibit the very same behavior" and he found lots of horrible cases and it's interesting the things that he points out. You may not know, but JSON may not start with a space because the original spec for JSON says spaces are allowed *between* tokens. So that means you can't have an inital space. Although the spec doesn't define what a token is so maybe it's OK anyway. [laughter] So HTML5 is based on programming rather than markup rather than declarative and there's a problem about that because standardization flies out of the window. I'm using CSS presentation mode here on a very old browser because it's not supported anymore, but it allows you to specify how any document can be formatted when doing a presentation. HTML5 took the approach that you could do this better in JavaScript and so no browser supports presentation mode anymore. Now there are lots of JavaScript packages to do presentation and none of them are the same, they all different! Which means if I use one and it dies or I want to switch to another one, I can't without changing all my documents; the standardization has disappeared. So programmers are now doing our document design for us all the document formats become proprietary, in other words because they're not standardized so there's no interoperability and that's the whole point of standards. So that's why there's so few new elements in HTML5: they haven't done any design and instead said "if you need anything you can just always do it in JavaScript" and that's exactly what they have done and they're all different. So here's a guy complaining about having to use JavaScript he says "Well which version, which flavor of JavaScript are you going to use? Are you going to use a transpiler? For which language? Grunt? Gulp? Bower? Yeoman? Browserify? Webpack? Babel? Common.js? And am I mixing things up? Am I confused? Talking to the community about my analysis paralysis loop caused by the excessive amount of available tools to choose from resulted in the community suggesting to try out, spend time, learn and investigate four more technologies that I haven't even considered in the first place! Good job JavaScript." And so, because JavaScript is difficult in itself we are getting frameworks. So which framework are you going to use? Are you going to use bare metal JavaScript or are you going to use Angular, Dojo, Bootstrap or one of the other 26 listed on Wikipedia? Are they compatible? No. What happens if your chosen package dies which has happened or is no longer supported, which has happened or it doesn't get updated for the latest versions of the browsers, or the owners change the license terms which recently happened for React which caused some people to say they were going to have to rewrite their whole website because they could no longer accept the new license. You have to change all your documents: there's no standardization. This is why we need standards not proprietary formats like frameworks. This is just an example of some of the frameworks how you would get a table with an ID called 'test-table' as you can see they're all different and so if you use one package you have to do it in one way and if you use another package you do it another way you have to change everything if you want to change your website for a new framework. And then there's the wonderful left-pad disaster. So NPM is a website that helps keep track of framework packages for building websites. It's very popular: it has a billion downloads per week. One guy had a package on NPM which was called kik and there was another program in the world called kik and the owners of that program, they decided they wanted to use NPM and they were upset about the fact that was already a program called kik on NPM and apparently NPM only has a single global namespace which is a mistake in itself. So they owned the trademark for the word kik and so they went to this guy and they said "we want you to rename your package". Actually they didn't ask him, the lawyers sent a cease-and-desist letter. So he didn't react to this well; he said "no thank you" and so kik the company went to npm the company and said "this guy's using our trademark we want you to take his program down." and they did! Which upset this guy greatly. So he said "screw you lot, I'm going" and he removed all 250+ packages that he had on NPM and took them elsewhere. Which broke the web, because it turns out he had an 11 line package called 'left-pad' that was used by hundreds of packages over the web; it was part of the dependencies of an enormous number of frameworks and so the web broke and nobody knew why because it was right at the bottom of this enormous line of dependencies and all these web sites suddenly weren't working. But that's not all! Because having unpublished all these 250 modules all these names were suddenly available for use but built into frameworks all over the web so anyone could have gone in there and replaced it with something malicious and people would not necessarily have known. So: brittleness. There's an interesting thing about JavaScript, that actually not everybody has it and in fact the Government Digital Service in the UK ran an experiment and discovered that 1.1% of their 'customers' did not have JavaScript or couldn't accept JavaScript, so one in every 93 users. So for Amazon that would be 1.75 million people a month and that's a large number. So even JavaScript is not a good solution for that reason. And then we have bloat. If you want to look at one single tweet of 140 characters on Twitter you have to download a megabyte page. It's 5200 lines of HTML before you even get to the 5 JavaScript packages and in fact the whole of James Joyce's Ulysses is only half as long again. [laughter] Again, I recommend you reading this link here "The website obesity crisis". I'll just read one section to you. "An article from 2012 titled 'The growing epidemic of page bloat' warns that the average web page is over a megabyte in size. The article itself is 1.8 megabytes in size. Two years later an article called 'The overweight web' warns that average page size is approaching 2 megabytes. That article is 3 megabytes long. If present trends continue, there is the real chance that articles warning about page bloat will exceed 5 megabytes by 2020. [laughter] And speed, this is a new one. Because of GDPR, USA Today decided to run a separate version of their site for EU users without all the tracking scripts and ads "The site seemed very fast" says this guy "so I did a performance audit". It had gone from 5.2Mbytes to 500 kilobytes and they went from a load time of more than 45 seconds to 3 seconds! From 124 javascript files to zero and from a total of more than 500 requests to 34. So this is what we are getting from JavaScript. And of course accessibility. We will all one day be 80, we will all one day need to be able to read websites with our bad eyes but as this person points out many developers who have grown up only using frameworks have a total lack of understanding about the fundamentals of HTML, such as valid and semantic markup. This is a great concern as semantic markup is one of the core principles of an accessible web. And I feel so sorry for Nicole Henninger who said this "You know I feel like I blinked and then all of a sudden what I thought was my job (making websites) was suddenly not my job at all because now I'm being told that I need to do this other stuff that I don't even like and people wonder why I'm wielding a stiletto like a weapon, and screaming 'I hate JavaScript, you can't make me, no means no', and considering a second career in comedy writing." [laughter] So just to give you an example, here's a drop-down list as an example of how you now have to do it this is what you have to mark it up as. You have to give a button and you have to add all sorts of junk around it and this still doesn't do it yet because you've got to write some JavaScript to actually then cause it to drop-down and it's a horrible mess. Then there's the little design that they have done, the design techniques that they used. One of the worst is what they call "paving the cow paths" which frankly I think is an offensive name but it's actually called that in the "Design principles of HTML5" spec. So, this is actually my work, and you can see here that they should have paved that. That's where people actually go. Paving the cow paths, or 'desire paths' as they're more politely called, is based on architecture that basically when you create a new estate you don't put the paths down immediately you watch to see where the paths emerge and then you pave those things and you let the people who are using it do the designing for you. The HTML5 design principles document got this wrong because they got the definition of paving the desire paths wrong. They said "When a practice is already widespread among authors consider adopting it rather than forbidding it or inventing something new". And then the example they gave is: "Authors already use the
syntax as opposed to
without the slash so there's no harm done in allowing that to be used. But that's not actually what paving the cow paths is about it's more like noticing that huge numbers of sites have a navigation drop-down and so supporting that natively. Cow paths are data: if you pave the cow paths what you're doing is you're setting in stone (literally) the behaviors caused by the design decisions of the past. Cow paths tell you what the cows want to do where they want to go not how they want to get there. So if they have to take a path around the swamp to get to the meadow then maybe it would be a better idea if you'll excuse the phrase to drain the swamp [laughter] or build a bridge over it rather than paving a path that they take to get round it. So the design is spotting that they want to get there and doing something about that rather than saying "oh well that's how they always walk let's put the path there". So paving cow paths is a bad design principle in this way especially the way they applied it. It can be a good design principle but you have to know how to apply it. Here's one example of how they did it badly they spidered millions of pages because they could and on the basis of that data decided what should be excluded from HTML5 which is already against their design principles because they said 'don't exclude things'. So for instance the 'rev' attribute on link they decided to remove because not enough people were using it. But that's not paving the cow paths! That's putting a fence across a cow path because not enough cows are walking that way. So it means that people who were using 'rev' can no longer use it. (Bastards) Irritated by colon disease [laughter] So for years, the wider community (that's us) had agreed to use a colon to separate the name from the identification of where that name comes from. The colon was already a legal name character and it was chosen, I think very wisely, to be backwards compatible so that it would still work in old processors but the new processors could recognize it in a new way; so for instance xml:lang. But no, HTML5 had to develop a new separator: the hyphen. So now you get code like this: aria-labelledby="label". So these people, who so disdained namespaces, just went and invented them all over again! Which brings us to the "not invented here" syndrome. This is from "The universal principles of design" a very good book; it says: "There are four social dynamics that underlie not invented here: * the belief that internal capabilities of superior to external ones (in other words that they think they're smarter than us); * fear of losing control; * desire for credit and status; * and the significant emotional and financial investments in internal initiatives. Here's somebody else, CSS squirrel, it's a comic on the web each comic then the guy produced a blog describing what he was saying in it. He says "the amount of 'not invented here' mentality that pervades the modern HTML5 spec is odious. Accessibility in HTML5 isn't being decided by experts. Process, when challenged through W3C guidelines, is defended as being 'not like the old ways' in essence slapping the W3C in the face. He's made it clear he won't play by the rules. When well-meaning experts carefully announce their opposing positions and desire for some form of closing the gaps, Ian and the inner circle constantly express how they don't understand." So many groups had already solved lots of problems which HTML5 could have used, but no, they went on to reinvent something that already existed. RDFa being a good one. We had this question "how are you going to represent general metadata in HTML?" We created a cross working group task force of the interested parties in 2004 we had the first working draft in 2008 we had a recommendation for RDFa so it represented much more than 5 years of work and consensus and agreement and then in 2009 HTML5 comes with something they called microdata. It's actually copied from RDFa but then different, it was less capable, it couldn't do as much; with no warning, no discussion, no consultation, and in other words FUD ensued, because then people didn't know 'should we use microdata or should we use RDFa?'. And then in 2013 microdata got abandoned. Forward compatibility. XML did one great thing and introduced this notation for empty elements because it means you don't have to have a DTD or a schema to know how to parse a document because you know whether an element is going to be empty or not. But they dropped that, I think because of 'Irritated by colon disease' but that means they can now never add a new empty element into the language without breaking something because there's no way to spot, and empty elements are sort of magic things that are built into HTML5 that the parser just has to know about. XHTML required quotes around all attribute values which HTML didn't. HTML5 went back to the old way but here's somebody saying well actually you better just put quotes around it anyway because it's confusing to remember under what conditions you can and can't use quotes and it's also dangerous because there are XSS security vulnerabilities if you don't. And of course "show source" is of course now absolutely ghastly. This is just one chosen completely at random but I mean you can't look at documents any more because it's just ghastly. So HTML5 actually isn't about markup at all; it's actually only about the DOM. Unfortunately the old DOM group got closed long ago; it shouldn't have been then the HTML5 people should have been allowed to play their game there. So I claim that the web has been ruined by putting it in the hands of people who don't know how to design. But now, 2018, the last XML group at W3C has closed; today is Liam's last day at W3C. Thank you Liam for all your great work. [Enthusiastic Applause] W3C has abandoned the declarative web. What does the web need? it needs modularity it needs extensibility it needs accessibility it needs a declarative approach and who can offer that? XML. We need a 100 year web: we need to be thinking about 100 years time from now because it's the way that we distribute information. We want to be able to read the documents that we're providing now, we're creating now: we want to be able to read them in 100 years' time. We can read books that are 100 years old, it's ridiculous that we should expect that people in 100 years' time will have to be using a 100 year old implementation of JavaScript just in order to read our documents. At least declarative markup is easier to keep alive because it's independent of the implementation! So my talk today is a call to action we need a new movement to lead the web to its full potential We need to seize back the declarative web As XForms and other markups have shown we can still create meaningful declarative documents and serve them to HTML browsers. HTML is actually now the assembly language of the web and we can go back to having a coherent, declarative, author-friendly web. But how are we going to do it? We need to have a new organization. If W3 doesn't want to have us we need to go somewhere else and I think the thing to do is we need to create an organization where we can do this work. We need a new home for our work where the people who are there understand what we're doing and offer moral support so that we don't have the fights that we've had in the past with people who didn't understand what we're trying to do. We need a place where we can continue to develop our specifications and we need a home for declarative. What can we do? Well we've still XProc being developed: that needs a home. XForms is being developed: that needs a home. Invisible XML got mentioned I think that would be an excellent thing to be used and also if we're going to do it via HTML as an assembly language then we need to create standards so that namespaces can collaborate so that you can have a plug-in bit of JavaScript for one namespace and another namespace and so that they play together nicely. Socially we can do conferences and meetings and volunteers. What do we need? We need a name and a brand and I'm glad that I'm giving this talk at the beginning of the week because we can talk about this. I'm not sure if we actually need a physical location but we can talk about that. We need to think about how we can do an infrastructure: domain website email. I heard people talking about how to do funding yesterday which is very good because we'll need funding, and there'll be legalities and process and membership and steering committees and elections. We want a hundred-year web. All this for a new organization can't happen overnight and I can't do it alone either. But it does really need to happen. We need to seize the initiative, recreate a declarative, robust, lasting, web. We can let HTML5 play, they can be our assembly language; it's time for some higher-level languages. So. Who's in? [Applause]