Data structures are the bones of every website and an integral part of HTML coding since tags are used to assign various settings and features to text segments. Among other things, such adjustments allow web developers to define paragraphs, titles, lists, hyperlinks, graphics, tables, videos as well as put fonts in bold lettering or italics. Programmes that read out the code receive detailed information on the structure of HTML documents as well as their depictions as defined by the tagged elements. The content supplied by these tags isn’t captured when the code is automatically read out. As seen in the example below from a news article, the left depiction shows which information is registered by a programme, while the right one displays how a human reader would interpret the text:
While human internet users can infer that the headline is to be understood as a title, and the subheadline is the author’s name, etc., programmes can only interpret information that has been labeled (or tagged) in HTML code: headline (<h1>), subheadline <h1>, italics <i>. Such issues are relevant when search engine web crawlers are at play; these are responsible for determining a website’s relevance based on search queries. This is why many website owners enrich their HTML documents with machine-readable semantic information, which defines the meaning of individual content. This is known as structured data.
Why is structured data needed?
The idea of structuring website data so that programmes can process information shaped by human language comes from the concept of the semantic web. When properly used, structured data enables website content to be machine readable. This is particularly relevant for text-based search engines like Google, Bing, or Yahoo! When provided with corresponding tags, these Big Data giants are able to read and evaluate semantic information and process it into various display forms, such as the Knowledge Graph or Rich Snippets in the SERP (search engine result page). The latter aspect is especially important for website owners.
Rich Snippets are excerpts from web content that display basic information (URL, title, and description) in the SERPS. For this information to be displayed, all relevant content needs to be tagged in the HTML code and assigned a certain information type by the website owner. Currently, the market leader, Google, processes structured data in order to display Rich Snippets for the following data types:
- Product information: price, availability, reviews and user experiences
- Recipes: pictures, preparation time, calories, and reviews
- User experiences: restaurants, movies, stores and businesses
- Events: musicals, concerts, exhibitions, or festivals, including duration
- Software: reviews, price, user experiences
- Videos: description and image preview
- News articles: title, publication date, author details, and picture
For website owners, Rich Snippets have the advantage of taking up significantly more space in the SERPS and sticking out more, which leads to a higher click rate. Search result displays can be expanded using breadcrumbs (a graphical control element) and the sitelinks search box.
Google displays the sitelinks search box for navigational search requests. This happens when the desired website can be derived from the user’s search query, but its subpage can’t; this usually occurs when users search for brands. This process enables internet users to browse through websites directly in the SERPs, sparing the need of accessing individual sites. For site owners, sitelinks and search boxes again have the advantage of gaining more attention through the proportionately large amount of space this feature occupies in the SERPS.
Breadcrumbs display the position of a search hit within the structure of a website and help search engine users orientate themselves.
Exactly which search results are expanded with this feature depend on the different criteria search engines use to determine their relevance. This is why it’s important to tag your website accordingly; search engines need structured data in order generate Rich Snippets, breadcrumbs, or a sitelinks search box.
Structuring data on your own website
There are several standard formats that site owners follow in order to ensure that content with structured data is machine readable. These include microformats, RDFa, and microdata. All three formats for data structuring are based on semantic tagging, which is entered directly into the HTML code. Depending on the format, either traditional HTML attributes or new labelling elements can be used. The data format JSON-LD has become increasingly popular over the past few years; this option makes it possible to annotate a web page within a script.
The labelling format microformats is used for semantically tagging HTML and XHTML documents. Well-known HTML attributes, like class, rel, and rev are extracted from the website code, enabling programmes like web crawlers to read out semantic information. A typical use case would be to label contact information with the microformat hCard, which is integrated in the HTML code as class=’vcard’:
An example of common labelling for contact information in HTML:
|02||<div>first name last name</div>|
Tagging contact information with the microformat hCard
|02||<div class="fn">first name last name</div>|
|04||<div class="tel">phone number</div>|
|05||<a class="url" href="http://website.com/">http://website.com/</a>|
While the contact information in pure HTML markup is tagged as a div element, integrating the microformats hCard via the HTML attribute class=‘vcard’ enables distinct semantic annotations for specific bits of information—like names, organisations, or telephone numbers—to be incorporated. The advantage of this type of labelling is the easy application of known HTML attributes. Doing this limits the options of semantic annotations with microformats to a few predefined elements. Using class attributes can also lead to conflicts with CSS. An API for extracting data is also not supported by microformats.
RDFa stands for ‘resource description framework in attributes’. The W3C recommends this format for embedding RDF statements in HTML, XHTML, and other XML dialects. Instead of having to rely on common HTML attributes, RDFa introduces new attributes that enable complex semantic annotation. The following example shows contact information as structured data in RDFa format:
Auszeichnung von Kontaktinformationen mit RDFa
|01||<div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Person">|
|02||<div property="v:name">first name last name</div>|
|04||<div property="v:tel">phone number</div>|
|05||<a href="http://website.com" rel="v:url">www.website.com</a>.|
Before tagging data with the RDFa format, the corresponding XML namespace has to be defined. The attribute typeof specifies which data type the subject of an RDF statement is associated with. The attribute property determines the predicate of a statement and also specifies characteristics for an element’s content. The advantages of data structuring with RDFa include its high flexibility and possibility to define custom vocabulary. Prefixes also help keep the code compact. RDFa supports a DOM API (document object model application programming interface) that extracts a website’s structured data and can also be used for interactive applications. A disadvantage is the focus on XML and XHTML, even though RDFa can also be embedded into HTML5. A detailed guide on schema.org can be found in our tutorial on the topic. For standardized vocabulary of RDFa annotations, consult the official website.
Microdata is a separately defined HTML5 module that can add attributes to existing markup language; these attributes are used for carrying out semantic annotations. As is the case with microformats and RDFa, this format also uses simple attributes in HTML tags for assigning item features. The microdata syntax is based on a vocabulary that allows items to be described as name/value pairs. This gives the markup format a compromise between moderate complexity, flexibility, and expandability. Microdata supports a native JSON export for transferring data and saving structured data as well as Microdata DOM API. Microdata is compatible with schema.org vocabulary.
The project Schema.org
Initiated by market leaders Google, Bing, Yahoo! and Yandex, the collaborative community Schema.org sets out to standardise the semantic annotation of website content. Browsing through the website, users will find a uniform set of schemes for structured data. Schema.org supports the data formats RDFa, Microdata, and JSON-LD.
Tip: testing structured data with Google
Labelling HTML documents through semantic annotation requires a high level of tact. Avoiding mistakes is best done by extending a page’s source code step by step and validating tags slowly as you go along. For this, Google provides a free structured data testing tool. Here, site owners are able to check individual code excerpts or enter the URL of a web page to check the source code for errors. The search engine giant also offers a tool, Data Highlighter, which lets users tag data directly on a web page in the browser. Relevant areas are marked with the mouse and then provided with a keyword. This method of semantic annotation doesn’t allow any direct labeling in the source code. The tagged areas can only be read by Google and can be used for additional display forms. Other search engines like Bing or Yahoo! don’t offer users the option of gathering such content.