import glob
import pandas as pd
from lxml import etree
The 2014 data are here!¶
Here I will basically do the same as with the 2013 data with the data from 2014. As the data structure has been slightly amplified you'll find some changes in this notebook, too.
Why all this?¶
Have you ever been to a German hospital? Did you get the chance to inform yourself before your stay? There is a ton of open care quality data around hospitals, especially in the hospital quality reports. They are widely regarded as useless for patients though. Even most scientists sighed when I told them that I am going to deep dive into these reports. Still, there was a chance I'd come up with some insights from this data mess. So what you'll find here is both, a lot of dead end roads and some interesting insights in hospital care quality.
Database and preparation¶
The database are the 2014 hospital quality reports ("Qualitätsberichte der Krankenhäuser"). These are the most recent available for the nearly 2000 hospitals in Germany. For each facility exist three files, separated into two databases. The first database ("QB") contains structural data including e.g. contact details, number of beds and number of cases for each indication (ICD-10) and procedure (OPS-2014). The second database ("EQS") contains a set of quality indicators defined by the AQUA institute. The file path contains -00- (for hospitals with only one location), -01-, -02- etc. for each location (if there is more than one) and a -99- file (which is a summary of all locations of one hospital).
For more general evaluations I had to exclude the -99- and -00- files.
Thanks to Stefan who did most of the coding.
QB_QUERY = {
'name': 'Krankenhaus/Kontaktdaten/Name',
'ik': 'Krankenhaus/Kontaktdaten/IK',
'plz': 'Krankenhaus/Kontaktdaten/Kontakt_Adresse/Postleitzahl',
'traeger': 'Krankenhaustraeger/Name',
'traeger_art': 'Krankenhaustraeger/Krankenhaustraeger_Art/Art'
}
EQS_QUERY = {
'ik': 'IK_Krankenhaus',
'land': 'Land'
}
def query_doc(root, xpath):
matches = root.xpath('.//' + xpath)
for match in matches:
val = match.text.strip()
try:
val_new = float(val.replace(',', '.'))
val = val_new
except ValueError:
pass
yield val
def query_doc_multi(root, xpaths):
matches = {}
for key, xpath in xpaths.items():
try:
matches[key] = list(query_doc(root, xpath))[0]
except IndexError:
pass
return matches
CACHE = {}
INFO_CACHE = {}
def get_info(path, base_query):
root = get_root(path)
if path not in INFO_CACHE:
extra = query_doc_multi(root, base_query)
INFO_CACHE[path] = extra
return INFO_CACHE[path]
def get_root(path):
if path not in CACHE:
with open(path) as f:
CACHE[path] = etree.parse(f)
return CACHE[path]
def search_path(path, xpath):
root = get_root(path)
return query_doc(root, xpath)
def search_paths(paths, xpath, exclude=None, include=None, base_query=QB_QUERY):
for path in paths:
if include is not None and not any(x in path for x in include):
continue
if exclude is not None and any(x in path for x in exclude):
continue
extra = get_info(path, base_query)
extra['path'] = path
results = search_path(path, xpath)
for r in results:
yield r, extra
def search_xml(xpaths, data_path=None, exclude=None, include=None, base_query=QB_QUERY):
names = glob.glob(data_path + '/*.xml')
return {key: pd.DataFrame(
search_generator(names, xpath, key, exclude=exclude, include=include, base_query=base_query)
) for key, xpath in xpaths.items()}
def search_generator(names, xpath, key, exclude=None, include=None, base_query=None):
for result, extra in search_paths(names, xpath, exclude=exclude, include=include, base_query=base_query):
d = dict(extra)
d[key] = result
yield d
def search_qb(**xpaths):
return search_xml(xpaths, exclude=('-99-',), data_path='data/base_2014',)
def search_eqs(**xpaths):
return search_xml(xpaths, data_path='data/external_2014', exclude=('-99-',), base_query=EQS_QUERY)
def search_qb_12(**xpaths):
return search_xml(xpaths, data_path='data/base_12')
def search_eqs_12(**xpaths):
return search_xml(xpaths, data_path='data/external_12', base_query=EQS_QUERY)
def search_qb_99(**xpaths):
return search_xml(xpaths, data_path='data/base_2014', include=('-99-', '-00-'),)
def search_eqs_99(**xpaths):
return search_xml(xpaths, data_path='data/external_2014', base_query=EQS_QUERY, include=('-99-', '-00-'),)
def search_qb_0_1(**xpaths):
return search_xml(xpaths, data_path='data/base_2014', include=('-00-', '-01-'),)
def search_eqs_0_1(**xpaths):
return search_xml(xpaths, data_path='data/external_2014', base_query=EQS_QUERY, include=('-00-', '-01-'),)
def apply_func(func, include=None, exclude=None, data_path='data/base_2014'):
for path in glob.glob(data_path + '/*.xml'):
if include is not None and not any(x in path for x in include):
continue
if exclude is not None and any(x in path for x in exclude):
continue
root = get_root(path)
yield func(root, path)
Step 1: Gather basic information on hospitals¶
Hospitals have a unique ID (IK-Nummer) which is very useful to merge query results with additional data from another source. For example, applying QB_QUERY on all hospitals yields their ID, name, provider and type of provider. It's the EQS_QUERY however where we find quality indicators. We basically match the results from those two sources via the hospital ID.
def add_info(df):
return df.merge(name_df, left_on='ik', right_index=True, how='inner')
name_df = search_qb(hospital_name='Krankenhaus/Kontaktdaten/Name')['hospital_name']
name_df = name_df.groupby('ik')[['name', 'traeger', 'traeger_art']].first()
name_df.head()
land_df = search_eqs_0_1(land='Land')['land']
name_df = name_df.reset_index().merge(land_df[['ik', 'land']][land_df.land != 'DV'], on='ik', how='left')
name_df = name_df.set_index('ik')
name_df.head()
How many hospitals are there per region?¶
name_df.land.value_counts()
Step 2: Pick out hospitals with abnormalities¶
Here we search only files of hospital with either one location (-00-) or with more than one location (-01-). For every abnormality found in the data set the query yields the corresponding hospital ID and the affected quality indicator.
import itertools
def get_a_quality_indicator(root, path=''):
if not '-00-' in path and not '-01-' in path:
return
ik = int(root.xpath('.//IK_Krankenhaus/text()')[0])
qas = root.xpath('Ergebnis//Qualitaetsindikator[.//Ergebnis_Berichtsjahr[starts-with(text(), "A")]]')
for qa in qas:
yield {
'ik': ik,
'kuerzel': qa.xpath('.//Kuerzel_Qualitaetsindikator/text()')[0],
'ergebnis': qa.xpath('.//Ergebnis_Berichtsjahr/text()')[0],
}
qa_qi_df = pd.DataFrame(list(itertools.chain.from_iterable(apply_func(get_a_quality_indicator, data_path='data/external_2014'))))
qa_qi_df.head()
qa_qi_df = add_info(qa_qi_df)
qa_qi_df.head()
Step 3: Analyse the abnormalities¶
Four main questions: What's the nationwide situation? How many abnormalities have been reported in my region? And in which hospital? At last, which quality indicator is mostly suspicious in my region?
#Brauche ich das hier?
qa_qi_df.pivot_table(index='land', columns='kuerzel')
qa_qi_df[qa_qi_df == ]
qa_qi_df.kuerzel.value_counts()
qa_qi_df.ergebnis.value_counts()
How many abnormalities have been reported per region?¶
qa_qi_df.land.value_counts()
Which type of provider has reported most quality abnormalities?¶
qa_qi_df.traeger_art.value_counts()
How many hospitals have reported abnormalities? And which hospital has reported most?¶
Find the answer to the first question below the table. 553 hospitals have at least one abnormality.
add_info(qa_qi_df.groupby(["ik"]).ergebnis.count().reset_index().sort_values("ergebnis", ascending=False))
Which quality indicator has been reported as abnormal mostly?¶
qa_qi_df.kuerzel.value_counts()
qa_qi_df_NW = qa_qi_df[qa_qi_df.land == 'NW']
qa_qi_df_NW
How many hospitals in NW have reported quality abnormalities? And which hospital has reported most?¶
Find the answer to the first question below the table. 115 hospitals have one or more quality abnormality.
add_info(qa_qi_df_NW.groupby(["ik"]).ergebnis.count().reset_index().sort_values("ergebnis", ascending=False))
Which type of provider reported most quality abnormalities?¶
qa_qi_df_NW.traeger_art.value_counts()
Which quality indicator has been reported abnormally most?¶
qa_qi_df_NW.kuerzel.value_counts()
qa_qi_df_NW[qa_qi_df_NW.kuerzel == '50722']
Research a particular hospital¶
qa_df[qa_df.ik == 260530502]