ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

Static Publications Site-Tutorial (ORC-Schlange) - BibTeX in Python

Beitragsseiten

BibTeX in Python

BibTeX is a format to hold a lists of references. It is interesting for us because the information is saved in ORCID as a BibTeX formatted string. We need to parse this string and convert it to a useful representation.

However, we do not need to reinvent the wheel here. There is already a parser for bibtex. It exists a python base program that is called Pybtex. It brings also a Python-API that we can use.

First we need to install Pybtex on the system with pip:

pip install pybtex

If this not work, have a look in the Git-Reposity for more installation instructions.

An overview of the python libary can be found in the Documentation of Pybtex.

To read the bibliography from a string, a simple function exists:

pybtex.database.parse_string(value, bib_format, **kwargs)

The return value is a BibliographyData object. The plan is to parse every work and then combine these objects to one BibliographyData object. This can be achieved by creating an empty BibliographyData object and then add the entries of all parsed BibliographyData to this object. To get access to the entries, the class variable entries is used that hold a dict of all entries. A simple helper function is written that does these.

The resulting BibliographyData object can be written to a BibTeX file. This is done using the to_file function of the BibliographyData object. A simple solution to view the content of such a BibliographyData object is to write it in a file.

The complete code looks as follows:

from pybtex.database import BibliographyData, parse_string
def joinBibliography(bib1, bib2):
	for key in bib2.entries:
		bib1.entries[key] = bib2.entries[key]

if __name__ == "__main__":
	db = DB()
	orcs = [OrcID(*t) for t in db.getList()]
	db.close()
	alldocs = []
	api = API()
	for orc in orcs:
		alldocs += [d for d in api.getWorks(orc) if orc.start <= d.date <= orc.stop]
	alldocs.sort()
	uniqdocs = [doc for doc,_ in itertools.groupby(alldocs)]
	bib = BibliographyData()
	for d in uniqdocs:
		joinBibliography (bib,parse_string(api.getWork(d),"bibtex"))
	bib.to_file(open("out.bib","w"))

In line 1, the BibliographyData class and the parsing function are loaded from pybtex. The helper function (line 2-4) simple adds all entries from bib2 to the entries from bib1. Until line 16, it is the normal main function. Then in line 16, the new empty BibliographyData object (bib) is created that is used to collect all data. In line 18, the API function getWork is used to get the BibTex format of the entry. The result format "bibtex" is given as arguments for the parsing function. The result is added to bib with the helper function. In the last line, the result then is written to a file named "out.bib".

The content of out.bib should look like this and it can be downloaded here:

@article{micceri1989unicorn,
    author = "Micceri, Theodore",
    title = "The unicorn, the normal curve, and other improbable creatures.",
    journal = "Psychological bulletin",
    volume = "105",
    number = "1",
    pages = "156",
    year = "1989",
    publisher = "American Psychological Association"
}

@article{barthes199411,
    author = "Barthes, Roland",
    title = "11 The Death of the Author",
    journal = "Media Texts, Authors and Readers: A Reader",
    pages = "166",
    year = "1994",
    publisher = "Multilingual Matters"
}

@article{collins1998s,
    author = "Collins, James J and Chow, Carson C",
    title = "It's a small world",
    journal = "Nature",
    volume = "393",
    number = "6684",
    pages = "409--410",
    year = "1998",
    publisher = "Nature Publishing Group"
}

@article{curtiss2013unicorn,
    author = "Curtiss, Michael and Becker, Iain and Bosman, Tudor and Doroshenko, Sergey and Grijincu, Lucian and Jackson, Tom and Kunnatur, Sandhya and Lassen, Soren and Pronin, Philip and Sankar, Sriram and others",
    title = "Unicorn: A system for searching the social graph",
    journal = "Proceedings of the VLDB Endowment",
    volume = "6",
    number = "11",
    pages = "1150--1161",
    year = "2013",
    publisher = "VLDB Endowment",
    doi = "10.14778/2536222.2536239"
}

@article{aad2015combined,
    author = "Aad, Georges and Abbott, B and Abdallah, J and Abdinov, O and Aben, R and Abolins, M and AbouZeid, OS and Abramowicz, H and Abreu, H and Abreu, R and others",
    title = "Combined Measurement of the Higgs Boson Mass in p p Collisions at s= 7 and 8 TeV with the ATLAS and CMS Experiments",
    journal = "Physical review letters",
    volume = "114",
    number = "19",
    pages = "191803",
    year = "2015",
    publisher = "APS"
}

@article{cheng2015generalized,
    author = "Cheng, Xinyue and Zou, Yangyang",
    title = "The generalized unicorn problem in Finsler geometry",
    journal = "Differential Geometry-Dynamical Systems",
    volume = "17",
    pages = "38--48",
    year = "2015"
}

To create such a bib out of the OrcIDs are already a useful application. The bib data is a standard that can be used in many cases. However, in our case we want to go a step further and create a pretty Website out of the data.

To our advantage Pybtex already has a system to write HTML files base on a BibliographyData object. The way, that this is done, is simple: first, a Style is created to format the data and then a Backend is used to write the data in the right format. It already exists a HTML backend that we can use. As a simple style we can use the standard "unsrt" bibliography style.

With using this two, the main function looks like as follows:

from pybtex.style.formatting.unsrt import Style
from pybtex.backends.html import Backend
if __name__ == "__main__":
	db = DB()
	orcs = [OrcID(*t) for t in db.getList()]
	db.close()
	alldocs = []
	api = API()
	for orc in orcs:
		alldocs += [d for d in api.getWorks(orc) if orc.start <= d.date <= orc.stop]
	alldocs.sort()
	uniqdocs = [doc for doc,_ in itertools.groupby(alldocs)]
	bib = BibliographyData()
	for d in uniqdocs:
		joinBibliography (bib,parse_string(api.getWork(d),"bibtex"))
	style = Style()
	formatbib = style.format_bibliography(bib)
	back = Backend()
	back.write_to_file(formatbib,"out.html")

In line 1, is the unsrt Style loaded and in line 2, the html Backend. In line 16, the new Style object is created and in line 17, it is used to create a formatted bibliography. In line 18, the Backend object is created and in line 19, it is used to write the formatted bibliography in a file called "out.html".

Here is the resulting htm site (download here):

This does not look pretty, so, we start to tweak the result. First, we look in the Style.

The idee of the style is that for every entry a Rich text object is created. This objects than are rendered from the backends.

The Rich text has six classes:

  • Text
  • String
  • Tag
  • HRef
  • Protected
  • Symbol

The Symbol is the smallest atom that represent one special symbol, like a line break. The String class is an other atom of the Rich text classes, the last atom is Protected which is not affected by case-changing operations. The other classes are containers that can contain all Rich text classes. A HRef creates a link to something and Tag has a name and creates a tag with this name. Text than is a container with no special feature.

These classes give great possibles to define how a entry should look like. However, one thing is missing for our HTML rendering. In HTML-Tags can have options like special CSS classes or direct CSS commands. To solve this we create our one HtmlTag that is inherited form the normal Tag:

class HtmlTag(Tag):
	def __init__(self, name, opt, *args):
		super(HtmlTag,self).__init__(name, *args)
		self.options = opt
	def render(self, backend):
		text = super(Tag, self).render(backend)
		try:
			return backend.format_tag(self.name, text, self.options)
		except TypeError:
			return backend.format_tag(self.name, text)

In line 1, the new class is defined with the super class Tag. In line 2, the initialization starts. It gets a name and *args like a normal Tag as input but also gets a opt argument for the options. In line 3, the initialization of Tag with name and *args is called using the super function. After this, the extra opt argument is saved in self.options (line 4).

Every Rich text object has a render function that is called with the backend to render the right representation. So we need to overwrite these also to get the options to the backend after line 5. In line 6, the super function is used to render the text. These is necessary because a Tag is a container so all sub Rich texts must be render first. Then, the rendering can be send to the backend and the result is returned (line 8). It gets a name and the text as input like a normal Tag but also gives the options as input. However, not every backend supports such a rendering. So it can be that these function creates a TypeError because they expect that format_tag have only two arguments. This should not break the rendering. So the Exception handling is used to switch back to the normal Tag rendering in such cases. The statement is placed in a try block (line 7). After this, a except block is created that catches the TypeError (line 9) and return then the rendering without the options (line 10).

With this HtmlTag we can know create our own style that create prettier output. The Styles has many functions like:

  • format_article
  • format_book
  • format_inbook
  • format_inproceedings

and more. The different functions are called for different types of bibtex entries. We only use the format_article function. All entries in our example are article. However, the other types should not break the complete process so we again inherit from a existing style. We use again the unsrt Style as super class. The result should have the form:

<div>
<h4>*title*</h4>
<i>*authors*</i><br>
*journal*<br>
<a href="https://doi.org/*doi*">[ Publishers's page ]</a>
</div>

 To get this result the class looks like as followes:

class HtmlStyle(Style):
	def format_article(self, context):
		ret = Text()
		ret += HtmlTag("h4","style=\"margin-bottom: 2px;\"", context.rich_fields['title'])
		ret += Tag("i",context.rich_fields['author']) + Symbol('newblock')
		ret += context.rich_fields['journal']
		if 'volume' in context.fields:
			ret += Symbol("nbsp") + context.rich_fields['volume']
		if 'number' in context.fields:
			ret += Symbol("nbsp") + "(" + context.rich_fields['number'] + ")"
		if 'pages' in context.fields:
			ret = ret + ":" + context.rich_fields['pages']
		if 'doi' in context.fields:
			ret += Symbol('newblock') + HRef('https://doi.org/' + context.fields['doi'],"[ Publishers's page ]")
		return HtmlTag("div","class=\"" + context.fields['year'] +  " mix \"",ret)

The format_article function gets as input a context. This context has the same information like the corresponding entry in the variable fields (for example line 7). However, the same information is also given as Rich text in the variable rich_fields (for example line 4). In the cases where strings are needed the fields variable is used and where Rich text is needed the rich_fields variable is used.

In line 3, the return container is initialized as empty Text(). After this, new content is add at the end of this container.  In line 4, the title line as h4 is added to the content. Here, the HtmlTag is used directly. The HtmlTag get the options style that change the margin to the bottom. The authors are added in line 5. Here, the authors are warpped into a i-tag befor they are added to the text. After this, a newblock Symbol is added which stands for a linebreak. In line 6, the journal title is added simply as Text. After this some optional journal information are given (volume, number and pages). If they exists they should be add to the Text. In line 8, the volume is added. In front of it a other Symbol is placed. The nbsp Symbol stands for a non-breaking space. In line 10, the number is added. Here, "(" and ")" are added as normal Strings. They are automatically converted from the Text by the adding operation in a Rich-String. In line 12, the pages are added. Here, not "ret +=" is used but "ret = ret +". These look on the first glimpse as equivalent but the evaluation order is not the same. In the second case, first, the ret + ":" is evaluated. That means that the add functions of the Rich text is evaluaeted first. In the other case, the ":" + context.rich_fields['pages'] is evaluated first which triggers the add function of the standard String and creates an error. In line 14, the doi is added as HRef where the link is given as standard String. In the last line, the enclosing div is created as HtmlTag. Here, the options are classes: the year as number and "mix".  The later is used later in the tutorial.

The new main looks as fallows:

if __name__ == "__main__":
	db = DB()
	orcs = [OrcID(*t) for t in db.getList()]
	db.close()
	alldocs = []
	api = API()
	for orc in orcs:
		alldocs += [d for d in api.getWorks(orc) if orc.start <= d.date <= orc.stop]
	alldocs.sort()
	uniqdocs = [doc for doc,_ in itertools.groupby(alldocs)]
	bib = BibliographyData()
	for d in uniqdocs:
		joinBibliography (bib,parse_string(api.getWork(d),"bibtex"))
	style = HtmlStyle()
	style.sort = lambda x: sorted(x, key = lambda e:-int(e.fields['year']))
	formatbib = style.format_bibliography(bib)
	back = Backend()
	back.write_to_file(formatbib,"out.html")

The only difference to the previous main function are in line 14 and 15. In line 14, the new HtmlStyle is used instead of Style. In line 15, then the sort function is overwritten that sort all entries before they are rendered. The entries are sorted after the negative int value of the year member. Such that the sort order is reversed

The result can be downloaded here and looks now much prettier:

However, some parts do not fit to our expectation: the line breaks, the formatting of the title and the numbers in every entry. All three things have different sources in the Backend. So we can get rid of these by implementing our own HtmlBackend. Of course we only want change the things that are not in our favor, so the class will inherit form the normal html Backend.

The three things that we must change are:

  1. The interpretation of Symbols
  2. The use of the HtmlTag
  3. How an entry is written
  4. The enclosing html

The last one is not necessary but make things simpler at the end.

The result class looks as follows:

class HtmlBackend(Backend):
	symbols = {'ndash': u'&ndash;', 'newblock': u'<br/>\n', 'nbsp': u'&nbsp;'}
	format_tag = lambda self, tag, text, options =None: u'<{0} {2} >{1}</{0}>'.format(tag, text, options if options else "") if text else u''
	label = None
	def write_entry(self, key, label, text):
		if label != self.label:
			self.output(u'<h3 class=\"{0} year\">{0}</h3>\n'.format(label))
			self.label = label
		self.output(u'%s\n' % text)
	write_epilogue = lambda self: self.output(u'</div></body></html>\n')
	prologue = u"""<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
		<html>
		<head><meta name="generator" content="Pybtex">
		<meta http-equiv="Content-Type" content="text/html; charset=%s">
		<title>Bibliography</title>
		{HEAD}
		</head>
		<body>
		{BODY}
		<div id="content">
		"""
	def prepout (self, head, body):
		self.prologue = self.prologue.format(HEAD = head, BODY = body)
	def write_prologue(self):
		try:
			self.prepout("","")
		except ValueError:
			pass
		self.output(self.prologue % (self.encoding or pybtex.io.get_default_encoding()))

In line 2, a class variable symbols is set. The used dict has for every Symbol a entry that assign the corresponding rendering to it. The newblock entry is assigned to <br/> so that the line breaks work.

The next feature is that the HtmlTag really work. The normal Backend does not render the options field. This is done in line 3 where the format_tag function is overwritten with an optional argument options. If it is not given, it places an empty string as options by string formatting and conditional expression.

In line 4 to 9 the write_entry is overwritten. Thise is the function that is called for every entry. Here we can get rid of the numbers that are rendered before the entries. The function gets as arguments: a citation key, a label and a text. The citation key is not used from us anywhere. The label is the number that is given. However, we will change this in the main function so that this is the year of the entry. The last argument is then the entry as rendered text. We want that every time a new year is reached this year is printed as <h3>. So we must save in self what the last year was. This is done in self.label. In line 4, this is set to None because no year was rendered until now. In line 6, it is checked if a new label (year) is reached with this entry. If this is the case, a <h3> is output in line 7 and these label is saved as the last output label in line 8. Note that here the function self.output is used to create the output. This is a function of the Backend that writes the output to the file. In line 9, the rendered text is written with self.output.

The last feature is the enclosing html. Here, two functions are interesting: write_prologue and write_epilogue. Like the names suggest write_prologue is called before the entries are written and write_epilogue is called after the entries are written. The later, in line 10, is strait forward: Close the enclosing div, the body and the complete html. It is a simple lambda function. The more complex case is the write_prologue function because the complete head of the html file are written here. In line 11-21, the prologue is prepared as class variable. This is done as a Triple quoted string. In that way, the string goes over multiple lines. The string is then in the write_prologue function (line 24-29) rendered with self.output. In the prepared string is a "%s" (line 14) that is replaced in line 29 with the right encoding of the html file. The string also contains a "{HEAD}" and a "{BODY}". These are place holder of extra head and body content that can be added. For them a simple prepout function is given at line 22-23. However, if these function is not called they are replaced with empty strings. To ensure this, the function is called in write_prologue with empty strings (line 26). If the prepout function is called before it produce an error because "{HEAD}" and "{BODY}" no longer exists in the string. So a try (line 25-26) and a except block (line 27-28) are used to catch this ValueError and do nothing.

The main function only need a small change:

if __name__ == "__main__":
	db = DB()
	orcs = [OrcID(*t) for t in db.getList()]
	db.close()
	alldocs = []
	api = API()
	for orc in orcs:
		alldocs += [d for d in api.getWorks(orc) if orc.start <= d.date <= orc.stop]
	alldocs.sort()
	uniqdocs = [doc for doc,_ in itertools.groupby(alldocs)]
	bib = BibliographyData()
	for d in uniqdocs:
		joinBibliography (bib,parse_string(api.getWork(d),"bibtex"))
	style = HtmlStyle()
	style.sort = lambda x: sorted(x, key = lambda e:-int(e.fields['year']))
	style.format_labels =  lambda x: [int(e.fields['year']) for e in x]
	formatbib = style.format_bibliography(bib)
	back = HtmlBackend()
	back.write_to_file(formatbib,"out.html")

In line 17, now the HtmlBackend and not Backend is used. The rest is the same.

The result can be downloaded here and looks as fallows:

This is looks exactly how we want it. So we have created our output.

The complete result of this section can be downloaded here: