ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

Static Publications Site-Tutorial (ORC-Schlange) - Rest-API

Beitragsseiten

Rest-API

Now we have every thing ready to interact with the public ORCID-API. To do this, we use a third party library: the Requests: HTTP for Humans library.

The requests library says about it self:

"Requests is the only Non-GMO HTTP library for Python, safe for human consumption."

Using this library is very simple and straight forward. However, first we need to install it. Thanks to pip this is very simple. You have to run:

pip install requests

If this not work, have a look in the instalation introductions of requests.

With requests installed, we can look in the public API of ORCID. As a reminder, we want read public data out of the sandbox instance of ORCID. We use the 2.0 version of the API and get all answers in json format. How this is done is explained on the website of the API in a Basic tutorial from ORCID.

This tutorial will cover all required endpoints and queries used for the publication list. However, if you want further information or change the queries, the links above are good points to start.

The interaction with the API needs to be authorized. This authorization uses a client_id and client_secret that you can create from you account for an app. Here we can simple use already created data from Norbert E. Horn. Since it is in the sandbox. the data is not so important to be secret. With these data, we can receive a read-public access token. This is the first API interaction with a specific endpoint. The answer contains a the token. Thes token is then used for all other interactions with the API and send with every request.

Beside this initial interaction to get the token, we have two more interactions with the API: get all WorkSummarys of a specific OrcID and get the complete work for a specific WorkSummary. However, we start with the initialization of the class where we get the access token:

from requests import Session
class API:
	auth = "https://sandbox.orcid.org/oauth/token"
	ORC_client_id = "APP-DZ4II2NELOUB89VC"
	ORC_client_secret = "c0a5796e-4ed3-494b-987e-827755174718"
	def __init__(self):
		self.s = Session()
		self.s.headers = {'Accept': 'application/json'}
		data = {"grant_type":"client_credentials", "scope":"/read-public","client_id":self.ORC_client_id, "client_secret":self.ORC_client_secret}
		r = self.s.request(method ="post",url= self.auth, data=data)
		self.s.headers = {'Accept': 'application/json',  "Access token":r.json()["access_token"]}

In the first line a class called Session is loaded from the requests library. This is the Session that interact with the API.

In line 3-5 the class variables are save that are used to get the access token, first the url which provides the token for the ORCID-Sandbox, then the data of Norber E. Horn.

In line 6, the initialization starts. In line 7, a new Session object is created and saved in self. In line 8, the headers property of this Session is set. It is a dict with data that is send as header in all requests that are made with this session. At this point, we save the information that we want the answers in json format. In line 9, a dict is saved that contains the data that is send with the request. Four property are saved:

  • The "grant_type" which is set to "client_credentials" because we use client credentials to get the token.
  • The "scope" which is set to "\read-public" because we want a token to read public data.
  • The "client_id" to authenticate the access.
  • The "client_secret" to verify the authentication.

In line 10, finally,  the request is send. This is done by the Session object with the request function. This function is given three arguments:

  • The method which is the HTTP method here we make a "post".
  • The url which is here the auth address.
  • The data as the prepared data dict.

The result is a Response object. The function json() will parse the response and creates a dict of the result. In the dict, the key "access_token" stores the token (as a string). In line 11, it is parsed and saved as a new header for the Session.

The next class function is the getWorks function that get all WorkSummarys of a given ID. To do this, we need the endpoint of the ORCID-API returning this information. An overview of all endpoints are given in the basic tutorial as a table. The endpoint that we are looking for is "/works". It gives a summary of research works. To complete the URL, we also need the resource URL for the v2.0 public API of the sandbox: "https://pub.sandbox.orcid.org/v2.0". The complete URL is then:

https://pub.sandbox.orcid.org/v2.0/[ORCID iD]/works

An overview how the response from the API look like is described on github. However, these is not trivial to understand. To get the idea faster, it is better to look in one response example.

In the following, a shorted example result ist shown:

{
	'last-modified-date': {'value': 1497863814424}, 
	'group': [
		{
			'last-modified-date': { 'value': 1497791040610}, 
			'external-ids': {
				'external-id': []
			}, 
			'work-summary': [
				{
					'put-code': 837564, 
					'created-date': {'value': 1497791040610}, 
					'last-modified-date': {'value': 1497791040610}, 
					'source': {
						'source-orcid': {
							'uri': 'http://sandbox.orcid.org/0000-0002-1909-4153', 
							'path': '0000-0002-1909-4153', 
							'host': 'sandbox.orcid.org'
						}, 
						'source-client-id': None, 
						'source-name': {'value': 'Norbert E. Horn'}
					}, 
					'title': {
						'title': {'value': 'Finding the data unicorn: A hierarchy of hybridity in data and computational journalism'}, 
						'subtitle': None, 
						'translated-title': None
					}, 
					'external-ids': {
						'external-id': []
					}, 
					'type': 'JOURNAL_ARTICLE', 
					'publication-date': {
						'year': {'value': '2017'}, 
						'month': None, 
						'day': None, 
						'media-type': None
					}, 
					'visibility': 'PUBLIC', 
					'path': '/0000-0002-1909-4153/work/837564', 
					'display-index': '1'
				}
			]
		}, 
		...
	],
	'path': '/0000-0002-1909-4153/works'
}

The response is a json object with three members: last-modified-date, group, and path. Only the group member is at the moment interesting because it is a list of the work-summaries. In these list, for every work of the researcher, a json object with again three member is stored. For us, only work-summary member is interesting. It is a list with exactly one element which is the work-summary object.

This work-summary object has many members. For us are three of them interesting: the title, the publication-date and the path. The last one contains the API path to the complete record of this work. The title is again an object where the title as string can be found in title.value. The publication-date is an object that has for year, month, and day separate members. They can be None or store the respectic date value in value.

Now, we need a python function that parse these information:

	baseurl = "https://pub.sandbox.orcid.org/v2.0"
	getDate = lambda self,d: Date(d["year"]["value"],d["month"]["value"] if d["month"] else None, d["day"]["value"] if d["day"] else None )
	def getWorks(self,id):
		r = self.s.request(method= "get",url = "{0}/{1}/works".format( self.baseurl, id.getID()))
		for work in (w["work-summary"][0] for w in r.json()["group"]):
			yield WorkSummary(work["path"],work["title"]["title"]["value"],self.getDate(work["publication-date"]))

In line 1, the baseurl is saved as class variable so that it can be used later. In line 2, a helper function is defined that transforms a publication-date object from the API into a Date object as described above. The function is a lambda function and gets  self  and the date in dict format as d as input. The transformation is not completly trivial because the day should be None if  ["day"] is None and otherwise it is the value of  ["day"]["value"]. The last one creates an error if ["day"] is None. To solve this, a conditional expression is used to check if ["day"] is not None. If so the value of ["day"]["value"] is used. The same must be done for the month.

The getWorks function get as argument an id which should be an OrcID object. For these id, all summaries are loaded and a all WorkSummary objects are returned. However, a look in the function shows that it is not a normal function with a return keyword instead the yield keyword is used. This makes the function to an generator function that returns a iterator.  Such iterator can be used in a for loop and in every iteration the next() function is called. The generator is executed like every function until a yield is reached. The return value behind yield is the result of the first next(). The status of the function is saved. Every time the next() function on the iterator is called, the function continues at the point where the last yield was called until the next yield is reached. This means that after one value is processed, it is discarded from memory. When no further yield is reached, the iteration stops.

In line 4, the request is sent. It is in this case the "get" method used like the API expect it. The url is here created with string formating. In line 5, the iteration over the works starts. This is done with a special case of the list comprehension. The works are given as list in the "group" member. So we want to iterate over all elements in this list. The objects are saved in w. For these objects, the work summary is found in the first element of the member "work-summary" so, we want only iterate overt this. A list comprehension: 

[w["work-summary"][0] for w in r.json()["group"]]

creates a list with the exactly the objects that we iterate through. However, in the getWorks function the outer "[]" are replaced with "()". This means that we not create a list but we create an iterator with generator expression. Thus, every element is created when it is needed and discarded afterwards. This is more memory friendly then creating the complete list. In line 6, for every work, a WorkSummary object is created and yielded. The path and title are obtained using a simple routing to the strings in the json. For the date, the getDate function is used.

Using this function, we can create a list off all works and can filter them. This is described in the next chapter. However, after the filtering the complete data of the work should be parsed. So, we need a second function in the API that get the complete work for a WorkSummary. The endpoint of this is "/work/[id]". These endpoints are already saved with the id in the WorkSummarys as path.

The following shows an example for a response of this endpoint:

{
	'created-date': {'value': 1497791040610}, 
	'last-modified-date': {'value': 1497791040610}, 
	'source': {
		'source-orcid': {
			'uri': 'http://sandbox.orcid.org/0000-0002-1909-4153', 
			'path': '0000-0002-1909-4153', 
			'host': 'sandbox.orcid.org'
		}, 
		'source-client-id': None, 
		'source-name': {'value': 'Norbert E. Horn'}
	}, 
	'put-code': 837564, 
	'path': '/0000-0002-1909-4153/work/837564', 
	'title': {
		'title': {'value': 'Finding the data unicorn: A hierarchy of hybridity in data and computational journalism'}, 
		'subtitle': None, 
		'translated-title': None
	}, 
	'journal-title': {'value': 'Digital Journalism'}, 
	'short-description': None, 
	'citation': {
		'citation-type': 'BIBTEX', 
		'citation-value': '@article{hermida2017finding, title= {Finding the data unicorn: A hierarchy of hybridity in data and computational journalism}, author= {Hermida, Alfred and Young, Mary Lynn}, journal= {Digital Journalism}, volume= {5}, number= {2}, pages= {159--176}, year= {2017}, publisher= {Routledge}}\n\n'
	}, 
	'type': 'JOURNAL_ARTICLE', 
	'publication-date': {
		'year': {'value': '2017'}, 
		'month': None, 
		'day': None, 
		'media-type': None
	}, 
	'external-ids': {'external-id': None}, 
	'url': None, 
	'contributors': {'contributor': []}, 
	'language-code': None, 
	'country': None, 
	'visibility': 'PUBLIC'
}

The response is one object with many members. Part of them are already known from the summary, others are new but also not interesting for us. In fact, the only way to get the complete record is to read the citation. The other members do not contain all information. So here, we interested in this and simple want a function that returns the citation-value.

The function looks as follows:

	def getWork(self, summary):
		r = self.s.request(method= "get",url= self.baseurl + summary.path)
		return r.json()['citation']['citation-value']

The function is defined straight forward. It gets a summary which is a WorkSummary object as input. First, the request is made (line 2). It is again a "get" and the url is the combination of the baseurl and the path of the summary. Then, from the response the citation-value is obtained and returned.

The complete class:

from requests import Session
class API:
	auth = "https://sandbox.orcid.org/oauth/token"
	ORC_client_id = "APP-DZ4II2NELOUB89VC"
	ORC_client_secret = "c0a5796e-4ed3-494b-987e-827755174718"
	def __init__(self):
		self.s = Session()
		self.s.headers = {'Accept': 'application/json'}
		data = {"grant_type":"client_credentials", "scope":"/read-public","client_id":self.ORC_client_id, "client_secret":self.ORC_client_secret}
		r = self.s.request(method ="post",url= self.auth, data=data)
		self.s.headers = {'Accept': 'application/json', "Access token":r.json()["access_token"]}
	baseurl = "https://pub.sandbox.orcid.org/v2.0"
	getDate = lambda self,d: Date(d["year"]["value"],d["month"]["value"] if d["month"] else None, d["day"]["value"] if d["day"] else None )
	def getWorks(self,id):
		r = self.s.request(method= "get",url = "{0}/{1}/works".format( self.baseurl, id.getID()))
		for work in (w["work-summary"][0] for w in r.json()["group"]):
			yield WorkSummary(work["path"],work["title"]["title"]["value"],self.getDate(work["publication-date"]))
	def getWork(self, summary):
		r = self.s.request(method= "get",url= self.baseurl + summary.path)
		return r.json()['citation']['citation-value']

 With this, we can write a new main class that gets all the WorkSummarys and print them:

if __name__ == "__main__":
	db = DB()
	orcs = [OrcID(*t) for t in db.getList()]
	db.close()
	alldocs = []
	api = API()
	for orc in orcs:
		alldocs += api.getWorks(orc)
	for d in alldocs:
		print (d)

In line 9, the getWorks is called and the resulting list is added with a "+=" to the alldocs list.

The output should look like this:

Finding the data unicorn: A hierarchy of hybridity in data and computational journalism: 2017-None-None
The generalized unicorn problem in Finsler geometry: 2015-None-None
Unicorn: A system for searching the social graph: 2013-None-None
The unicorn, the normal curve, and other improbable creatures.: 1989-None-None
Combined Measurement of the Higgs Boson Mass in p p Collisions at s= 7 and 8 TeV with the ATLAS and CMS Experiments: 2015-None-None
It's a small world: 1998-None-None
Combined Measurement of the Higgs Boson Mass in p p Collisions at s= 7 and 8 TeV with the ATLAS and CMS Experiments: 2015-None-None
11 The Death of the Author: 1994-None-None
Kritik der reinen Vernunft: 1889-None-None

The complete result of this section can be downloaded here: