ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

Static Publications Site-Tutorial (ORC-Schlange) - Sorting and Filter (2)

Beitragsseiten

Sorting and Filter (2)

In this step, we want get rid of works that are not in the period a person worked for us. We also want to remove duplicated works.

For this, we use the comparison functions that are implemented in the classes and some standard python libraries.

First, we want to get rid of the works that not overlap with the dates of the OrcID objects i.e. the works that are not belong to our group.

This is done by altering the line where the getWorks are added to alldocs:

alldocs += [d for d in api.getWorks(orc) if orc.start <= d.date <= orc.stop]

Here, we do again a list comprehension but with an if statement in it. This means that in the new list only these elements are contained for which the if condition is evaluated as True. The statement checks if the date of the work (d) is between the start and end date of the OrcID. Note the chained form of these two checks with two <= statements. In fact, this is more effective then two separate statements:

"Comparisons can be chained arbitrarily, e.g., x < y <= z is equivalent to x < y and y <= z, except that y is evaluated only once (but in both cases z is not evaluated at all when x < y is found to be false)."

Doing so, we have now a list (alldocs) containing all works that are done by the group. The next step is to sort them. For this the WorkSummary class has already the smaller and equal operation so that we can do a simple sort() call an it:

alldocs.sort()

The last part is now the reducing the duplicated entries. This can be done using the standard library itertools and the groupby function. These function needs a sorted list as input so that equal objects are grouped together. The function reduces these groups to a tuple of a key and a list of the objects. Here, the key is the first of these objects. So that we can simple iterate over the function result keeping only the keys:

import itertools
uniqdocs = [doc for doc,_ in itertools.groupby(alldocs)]

We create a new list that contains only the key of the groupby call. The first element (the key) is saved in doc and the second is with the "_" symbol marked to be thrown away.

With these changes the main function looks as follows:

import itertools
if __name__ == "__main__":
	db = DB()
	orcs = [OrcID(*t) for t in db.getList()]
	db.close()
	alldocs = []
	api = API()
	for orc in orcs:
		alldocs += [d for d in api.getWorks(orc) if orc.start <= d.date <= orc.stop]
	alldocs.sort()
	uniqdocs = [doc for doc,_ in itertools.groupby(alldocs)]
	for d in uniqdocs:
		print (d)

The result is:

The unicorn, the normal curve, and other improbable creatures.: 1989-None-None
11 The Death of the Author: 1994-None-None
It's a small world: 1998-None-None
Unicorn: A system for searching the social graph: 2013-None-None
Combined Measurement of the Higgs Boson Mass in p p Collisions at s= 7 and 8 TeV with the ATLAS and CMS Experiments: 2015-None-None
The generalized unicorn problem in Finsler geometry: 2015-None-None

The list is now sorted and all duplications are removed.

The complete result of this section can be downloaded here: