ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

Static Publications Site-Tutorial (ORC-Schlange) - Date, OrcID and WorkSummary Class (Filter 1)

Beitragsseiten

Date, OrcID and WorkSummary Class (Filter 1)

In the following, we will define our data model. It consists of three classes: Date, OrcID, and WorkSummary. Data retrieved from ORCID using the ORCID-API will be parsed into objects of these types. The classes furthermore implement methods for comparsion which will be used to filter the data.

The first class is the Date class. It represent a date and is either a publcation date or start or stop date of a user's membership at the inistute. The specialty of this Date class is that it only need a year to be a valide date. Undefined parts are 'None' and are considered as equal to every thing. A function is implemented that represent these fact.

class Date:
	def __init__(self, y, m, d):
		self.y = int(y)
		self.m = int(m) if m else None
		self.d = int(d) if d else None
	def check (self, other, attr):
		if getattr(self,attr) == None or getattr(other,attr) == None:
			return 1
		if getattr(self,attr) < getattr(other,attr):
			return 1
		if getattr(self,attr) > getattr(other,attr):
			return -1
		return 0
	__le__ = lambda self, other: True if 1 == (self.check(other,"y") or self.check(other,"m") or self.check(other,"d") or 1) else False
	__str__ = lambda self: str(self.y) + "-" + str(self.m) + "-" + str(self.d)

In line 2-5 is the initialization of the Date class where the year, month, and the day can be a number, a string that represent the number or in case of month and day None. To make a number out of the string the built-in function int() is used. In Line 4 and 5 a conditional expression is used to check if the month (day) is None. Python interprets different values as False. The defintion is:

In the context of Boolean operations, and also when expressions are used by control flow statements, the following values are interpreted as false: False, None, numeric zero of all types, and empty strings and containers (including strings, tuples, lists, dictionaries, sets and frozensets). All other values are interpreted as true.

Here a None is interpreted as False and goes in the part after the else so that None is saved for this variable. These is necessary because "int(None)" results in an error.

In line 6-13 a helper function that check for a other date and a given attribute ("y","m", or "d") if they one of them are smaller at this attribute. To get access to these attribute an other build-in function is used, the getattr() function. Line 7 is the check if the attribute is None in one of the dates. If this is the case 1, is returned to show that self is smaller or equal to other. In the case where self is smaller then other (line 9) also 1 is returned. If self is larger then other for this attribute (line 11), a -1 is returned. When nothing of these is the case the attribute needs to be equal in these case a 0 is returned.

In line 14, the function is defined. In a python object, this is done by overwriting the __le__ function.  The function is written as a lambda function. It gets as arguments self and other. Then there is again a conditional expression that return True if a chain of checks returns 1 and False otherwise. These chain is linked with "or" and works because of the special definiton of or:

"The expression x or y first evaluates x; if x is True, its value is returned; otherwise, y is evaluated and the resulting value is returned."

First, the years (y) are compared. If they are equal, a 0 is returned. This is interpreted as False and the next check is evaluated. Otherwise, if the self.y is smaller then other.y, a 1 is returned from the check and the evaluation stops. The result then is a 1 and True is returned. If self.y is larger then other.y the check returns a -1 which also stop the evaluation but is unequal to 1. So False is returned. Thus, all three parts from year to day are checked. If all are equal, all are interpreted as False, if so the fourth case of the chain, where simply a 1 is returned, is evaluated. So that in this case the function also returns True.

In the last line, the standard string represantation of the Date class is overwritten. These is again a keyword function(__str__). Again a lambda function is used that simply concatenate the strings of the three parts with a "-" between them.

The second class is the OrcID class. This class represents an ORCID as stored ind the SQLight DB. So, it gets an id, a start and a stop date for initialization. We define also two helper functions. First, a getID function that formats the ORCID with a "-" every 4 symbols. Secondly, a function to get a nice string representation of the OrcID object.

class OrcID:
	def __init__(self, id, start, stop):
		self.id = id
		self.start = Date(*start.split("-"))
		self.stop = Date(*stop.split("-"))
	getID = lambda self: "-".join([self.id[4 * i : 4 * (i + 1)] for i in range(4)])
	__str__ = lambda self: self.getID() + ": " + str(self.start) + " - " + str(self.stop)

The initialization in line 2-5 saves the id, converts the start and stop dates into Date objects, and stores them. These conversations in line 4 and 5 expect that the input string is in the format "YYYY-MM-DD". This is the format how SQLite returns the dates. First, the String function split is used to make a list out of this [y,m,d]. The list then is unpacked in single arguments with the "*" operator.

In line 6 the getID function is defined as lambda function. The goal to place every 4 symbols a "-" is reached by first creating a list of these 4 symbol blocks. For this, i iterate over the range(4) i.e. [0,1,2,3]. and create then the block from 4*i to 4*(i+1). Afterwards, these blocks are joined in one string with "-" as separator with the str.join function. The last line is the simple string representation of the OrcID object in the form 'id:start-end'.

To test these two classes, we change the main function to as follows:

if __name__ == "__main__":
	db = DB()
	orcs = [OrcID(*t) for t in db.getList()]
	db.close()
	for orc in orcs:
		print("Do something with",orc)

In line 3, a list comprehension is used to create a list of OrcIDs out of the list of tuples that db.getList() returns. Again, unpacking is used to resolve the tuple that are saved in t to the parameters that OrcID gets.

The output should look like this:

Do something with 0000-0002-1909-4153: 1900-1-1 - 2016-12-31
Do something with 0000-0002-0183-570X: 1900-1-1 - 2016-12-31
Do something with 0000-0003-0397-7442: 1900-1-1 - 2016-12-31

 

The last class is the WorkSummary class. These class represent a summary of a work i.e. a publication. A WorkSummary has three fields: a path where more information can be found, a title and a publication date. These class should be compared later in the filter step. Thus, a smaller and an equal function are implemented.

class WorkSummary:
	def __init__(self, path, title, date):
		self.path = path
		self.title = title
		self.date = date
	__lt__ = lambda self, other: self.date.y < other.date.y or (self.date.y == other.date.y and self.title < other.title)
	__eq__ = lambda self, other: self.date.y == other.date.y and self.title == other.title
	__str__ = lambda self: self.title + ": " + str(self.date)

The initialization in line 2-5 is straight forward.

The two comparisons in line 6 and 7 will only compare the year of the publication and the title. The rest of the Date is not used to make the comparison not over specific. Both comparisons are written in lambda form. In line 6, the smaller function uses the keyword __lt__ and checks if the year is smaller or if the year is equal and if the title is smaller. For the equal comparison, the keyword __eq__ is used. The function is straight forward: only when the year and the title are equal the WorkSummaries are equal. The last line is a simple string representation of the WorkSummary object in the form "title: data".

The complete result of this section can be downloaded here: