My server killed my client



Recently, we experienced a drastic lag in one of our services, due to which some of our asynchronous data export jobs slowed down. Some calls that we expect to finish in 100ms were taking 10+ seconds. Interestingly enough, it turned out to be a case where our server was killing our client (of course, in context of http requests).

This blog post follows our debugging adventure and findings from this incident. Read on if you don't wish to waste hours debugging for anomalies in your http requests.


Our setup and the debugging

The change which caused the latency involved adding a new java web service - lets call it moriarty - and calling it from one of our other python web services, say sherlock.

def call_moriarty():
    response = requests.post(...) ..
    return json.loads(str(response.text))

We expect call_moriarty() to take less than 100 ms, but debugging showed it was randomly taking 7+ seconds in some cases. This latency was getting magnified to minutes due to multiple such requests being made in a serial manner.

To debug further, we added timing logs in moriarty - at all steps of the api call. Here, to our surprise, we found that everything was under milliseconds!
With a heavy heart, we ditched the moriarty's low performance hypothesis and decided to log timings in a more fine-grained step-by-step way on the client service (sherlock) code.

In sherlock, the http utility function which calls moriarty, did 2 things evidently:
1) making the api call
2) loading result into a dictionary and returning it

Ditching the pythonicity, we split the code into multiple lines and timed each step to see which one was causing the latency.

from timeit import default_timer as timer

def call_moriarty():

    #### #### step 1 starts #### ####
    start_time_1 = timer()
    response = requests.post(...) ..
    end_time_1 = timer()
    #### #### step 1 ends #### #### 
    
    logger.debug('time taken for step-1: {}', end_time_1 - start_time_1)

    #### #### step 2 starts #### ####
    start_time_2 = timer()
    data = json.loads(str(response.text))
    end_time_2 = timer()
    #### #### step 2 ends #### ####

    logger.debug('time taken for step-2: {}', end_time_2 - start_time_2)

    return data
time taken for step-1: 0.013552 
time taken for step-2: 9.287207 

To our surprise, step-2 was the cause of the delay that we were seeing. It is always the ones you least expect!
After this mildly surprising revelation, we started doubting whether this could be some performance kink of the builtin libs like json or other de-facto standard libs like requests - that we have been trusting blindly so far.

We began with json first. Googled around, and found some evidence similar to our case where json.loads() was taking un-expectedly high time for long strings. On trying other json libs like simplejson, yljl - we did not see any improvement in our case. All libs gave similar timings for json deserialization.

Our next suspect was str() which anyways, was only present for some extra type safety. For testing this, we stored  str(response.text) to a separate variable and measured timing of this step to find -  that this simple assignment, was actually taking all the time. Rest all steps were under 1 ms. Although, on removing str() also we got same timings.
This meant that the http call (and hence moriarty) was not the cause of latency. Instead, reading the response content and storing it in a variable was, strangely, taking all the time - 10 seconds in some cases.

The issue

Eager to end this rather long chain of debugging surprises, we googled around this behavior of python requests lib and stumbled upon the exact issue affecting us : https://github.com/psf/requests/issues/2359. Gist of the issue is as follows.

Python requests library can take an awful lot of time to detect content encoding when the response size is big and encoding is not specified in the response headers. This is, in turn, due to an issue in another library - chardet - which requests depends upon, to detect response encoding.

We just had to enable debug logs to see this chardet regression in action, in our case.

2019-12-30 04:29:09,520 urllib3.connectionpool: https://app.fyle.in:443 "GET /api/tpa/v1/expensesHTTP/1.1" 200 None
2019-12-30 04:29:11,591 chardet.charsetprober: EUC-JP Japanese prober hit error at byte 113939
2019-12-30 04:29:12,462 chardet.charsetprober: EUC-KR Korean prober hit error at byte 113939
2019-12-30 04:29:12,825 chardet.charsetprober: CP949 Korean prober hit error at byte 113939
2019-12-30 04:29:13,267 chardet.charsetprober: Big5 Chinese prober hit error at byte 113940
2019-12-30 04:29:13,665 chardet.charsetprober: EUC-TW Taiwan prober hit error at byte 113939
2019-12-30 04:29:14,756 chardet.charsetprober: utf-8  confidence = 0.505
2019-12-30 04:29:14,758 chardet.charsetprober: SHIFT_JIS Japanese confidence = 0.01
2019-12-30 04:29:14,762 chardet.charsetprober: EUC-JP not active
2019-12-30 04:29:14,763 chardet.charsetprober: GB2312 Chinese confidence = 0.01
2019-12-30 04:29:14,767 chardet.charsetprober: EUC-KR not active
2019-12-30 04:29:14,772 chardet.charsetprober: CP949 not active
2019-12-30 04:29:14,774 chardet.charsetprober: Big5 not active
2019-12-30 04:29:14,775 chardet.charsetprober: EUC-TW not active
2019-12-30 04:29:14,777 chardet.charsetprober: windows-1251 Russian confidence = 0.01
2019-12-30 04:29:14,778 chardet.charsetprober: KOI8-R Russian confidence = 0.01
2019-12-30 04:29:14,781 chardet.charsetprober: ISO-8859-5 Russian confidence = 0.01
2019-12-30 04:29:14,785 chardet.charsetprober: MacCyrillic Russian confidence = 0.0
2019-12-30 04:29:14,788 chardet.charsetprober: IBM866 Russian confidence = 0.0
2019-12-30 04:29:14,789 chardet.charsetprober: IBM855 Russian confidence = 0.01
2019-12-30 04:29:14,791 chardet.charsetprober: ISO-8859-7 Greek confidence = 0.01
2019-12-30 04:29:14,793 chardet.charsetprober: windows-1253 Greek confidence = 0.01
2019-12-30 04:29:14,794 chardet.charsetprober: ISO-8859-5 Bulgairan confidence = 0.01
2019-12-30 04:29:14,796 chardet.charsetprober: windows-1251 Bulgarian confidence = 0.01
2019-12-30 04:29:14,798 chardet.charsetprober: TIS-620 Thai confidence = 0.01
2019-12-30 04:29:14,801 chardet.charsetprober: ISO-8859-9 Turkish confidence = 0.5528345669959578
2019-12-30 04:29:14,803 chardet.charsetprober: windows-1255 Hebrew confidence = 0.0
2019-12-30 04:29:14,805 chardet.charsetprober: windows-1255 Hebrew confidence = 0.01
2019-12-30 04:29:14,807 chardet.charsetprober: windows-1255 Hebrew confidence = 0.01
2019-12-30 04:29:14,815 urllib3.connectionpool: Starting new HTTPS connection (1): app.fyle.in:443

Just to be super-sure that this issue was the cause of latency, we patched it on client side by manually assigning encoding to response before calling response.text and ran our timing tests again.

    ... 
    # hard-coding response encoding before step-2, for testing
    response.encoding = 'utf-8'

    #### #### step 2 starts #### ####
    start_time_2 = timer()
    data = json.loads(str(response.text))
    end_time_2 = timer()
    #### #### step 2 ends #### ####

    logger.debug('time taken for step-2: {}', end_time_2 - start_time_2)
    return data

time taken for step-1: 0.011861
time taken for step-2: 0.052383

And as expected (or just prayed for, at this time), everything was blazing fast again. That assignment step took less than 10 ms now.

The fix

So, we were finally successful in isolating the issue, but assigning the content encoding on client would not be the right way to fix it. The server/service should take care of including content encoding with response headers.

content-type : "application/json"
content-type : "application/json;charset=utf-8"

We fixed moriarty APIs by replacing the first header above, with the second one in response headers and python requests was happy again - latency was gone !

Now that we were wiser, we made sure our other services also followed suit. We decided to audit the code of all our services looking for similar latencies and ended up finding cases where this bug was causing latencies of upto 15 seconds on production.

Some reasons as to why this issue - which was the reason for significant performance regression - surfaced so late in our case :

  • dependence on response size.
    In APIs where response size is not huge, the latency caused is not substantial, hence goes un-noticed.
  • arbitrary nature / dependence on response content
    Even in cases of bigger response size, in-fact for most of such cases, we did not see latency consistently. Only 1 of every 4 responses took time to deserialise. This apparently had more to do with response content (special characters, un-conventional encodings etc.) than size - just that bigger response size aggravates this issue.
  • frequency and timing of calls
    This issue would not have been so prominent had the latency not been magnified due to multiple consecutive api calls.

Repeating for emphasis here. This hideous issue, present in one of the most used python libs out there, had slowed down our workloads by a factor of ~x1000 !!

Nevertheless, it was one nice debugging adventure resulting in findings which will save us an awful lot of head-scratches in future.
If you enjoy taking up such thrilling bug chases, we might have a job for you. Drop us a mail at careers@fyle.in ! :)