Opened 5 years ago

Closed 3 years ago

#133 closed defect (fixed)

Problem in Petascope to serve simultaneous request

Reported by: barboni@… Owned by: nkolev
Priority: major Milestone: 9.0.x
Component: petascope Version: 8.3
Keywords: concurrence scalability Cc: abeccati, pbaumann, pcampalani, damiano, mantovani, sebrecht
Complexity: Medium

Description

Ticket raised after a discussion on earthserver-services mailing list

http://groups.google.com/group/earthserver-services/browse_thread/thread/30e9011406db4853/

I have only one client that perform a series of parallel request to
Petascope.

Rasdaman support parallel read access and the problem seems related to
Petascope that is able to serve only one request at a time.
The same WCS requests, converted in the equivalent psql query and
submitted simultaneously to rasdaman, works fine.

In the log file that I have posted on the mailing list there are a lot messages like
"Preparing request for new connect by resetting old connection"

If I post a series of simultaneous requests, only the last request
received by petascope are served and it seems that the others are
cancelled.

I tried to use the Petascope installed at http://www.earthlook.org/ and
it can serve up to 10 simultaneous requests, mine only one.

Damiano


OS info:
lsb_release -a
LSB
Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: Scientific
Description: Scientific Linux release 6.0 (Carbon)
Release: 6.0
Codename: Carbon

gcc --version
gcc (GCC) 4.4.4 20100726 (Red Hat 4.4.4-13)
Copyright (C) 2010 Free Software Foundation, Inc.

java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.6) (rhel-1.43.1.10.6.el6_2-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

tomcat6 version
Server version: Apache Tomcat/6.0.24
Server built: December 5 2011 1239
Server number: 6.0.24.0
OS Name: Linux
OS Version: 2.6.32-220.4.1.el6.x86_64
Architecture: amd64
JVM Version: 1.6.0_22-b22
JVM Vendor: Sun Microsystems Inc.

Attachments (7)

test_parallel_requests.py (5.5 KB) - added by barboni@… 5 years ago.
rasjQuery.java (1.1 KB) - added by erodriguez 5 years ago.
concurrentRasj.java (506 bytes) - added by erodriguez 5 years ago.
catalina.out (38.7 KB) - added by barboni@… 5 years ago.
5 request (4 OK and 1 KO)
test_parallel_requests.2.py (10.7 KB) - added by barboni@… 5 years ago.
new version with wcps support
TestParallelRequests.py (4.5 KB) - added by barboni@… 4 years ago.
New test script with python unittest library
0001-Fix-handling-of-simultaneous-requests-in-petascope-t.patch (9.3 KB) - added by dmisev 4 years ago.

Download all attachments as: .zip

Change History (50)

Changed 5 years ago by barboni@…

comment:1 Changed 5 years ago by anonymous

The script test_parallel_requests.py perform a series of simultaneous requests to petascope and save the receive xml files.

#Usage: ./test_parallel_request.py APP N
# APP can be "r" for RASDAMAN or "p" for PETASCOPE
# N is the number of requests generated

At the beginning of the file configure your host name, port, output file path, collection name ecc

Comment or change rows 124 and 125 to manage subset.

obj.setSubset("x", 0,200)
obj.setSubset("y", 0,200)

comment:2 Changed 5 years ago by dmisev

  • Owner changed from dmisev to erodriguez
  • Status changed from new to assigned

Changed 5 years ago by erodriguez

Changed 5 years ago by erodriguez

comment:3 Changed 5 years ago by erodriguez

The bug resides in the rasj component of Rasdaman since it can't open multiple connections to the database. This is proven by the attached concurrentRasj.java program which tries to do simultaneous queries to a rasdaman server.

to compile:

javac -cp rasj.jar concurrentRasj.java rasjQuery.java

to run:

java -cp rasj.jar:./ concurrentRasj numThreads

If numThreads is set to one (one simultaneous query) the program runs without errors. But if numThreads is set to a number bigger than 1 (even 2) only one connection is opened and the rest of the connections fail to open, ie. the call to rasj.rnp.RasImplementation?.openDB() fails and throws an exception.

comment:4 Changed 5 years ago by anonymous

That's not good, did you discover why and how it could be fixed? I guess there's some static connection object.

comment:5 Changed 5 years ago by erodriguez

Not yet, I'm working on that. But I believe it's a little more complex since even running simultaneous queries as separate processes fails:

java -cp rasj.jar:./ concurrentRasj & java -cp rasj.jar:./ concurrentRasj numThreads

(ie. one simultaneous query per process) and in that case static variables shouldn't matter since each process only has one thread. I will continue working and see what I find.

comment:6 Changed 5 years ago by erodriguez

Correction of the command above:

java -cp rasj.jar:./ concurrentRasj 1 & java -cp rasj.jar:./ concurrentRasj 1

comment:7 Changed 5 years ago by dmisev

Quite strange.. I thought it'd be a problem in petascope, as it seems RasUtil?.executeRasqlQuery is a static method in petascope.

comment:8 Changed 5 years ago by erodriguez

Indeed, that was my initial guess as well, but making it non static didn't solve the problem, so I tracked it further and found out I couldn't open multiple connections with rasj. I tried many approaches (using different processes) using only one instance of implementation, using multiple instances of implementation, all resulted in the same error.

comment:9 Changed 5 years ago by erodriguez

Well, you guys you seem to know your code, the fix is indeed simple. The openDB method is calling the getFreeServer whith false as argument. This results in rasj allways using the same server (ie. the one running on port 7002). So concurrent queries fail since all use the same server. I will soon submit the patch. Calling getFreeServer with true as argument seems to fix the problem.

comment:10 Changed 5 years ago by barboni@…

Thanks rodriguez for your work

Today I found time to try your fix.
It seems that the problem related to the parallel request is partially solved but the system is unstable.

For example if I perform 5 simultaneous request to petascope i don't always get the same response. Some requests return an xml file with the error:

(1) "Error converting WCPS query to rasql query"

It's strange because the rasdaman logs don't show any error for the request failed and the error isn't systematic. Sometime all 5 requests work but often not.


I attached the catalina.log file for the case:
5 parallel request;

  • 4 response OK
  • 1 response with error (1)

Also considers that the probability of reproducing the error (1) increase with the number of requests. With 10 rasdaman server running and 15 simultaneous request i obtain very few correct results (one or two) and rarely the error is

(2) RasManager? Error: System overloaded, please try again later.


In conclusion it seems that the rasj.jar component is duplicated in different folder so the bugfix need to be propagated

The "~/rasdaman/java/dist/rasj.jar" accept parallel request but the following rasj.jar not:

~/rasdaman/applications/petascope/build/web/WEB-INF/lib/rasj.jar
md5sum = 5cb532a97781d10818cdc2600a475fa2

~/rasdaman/applications/petascope/lib/rasj.jar
md5sum = 5cb532a97781d10818cdc2600a475fa2

To compile petascope i used the rasj.jar at "~/rasdaman/java/dist/rasj.jar"


Feel free to ask me for additional info or log messages

Ciao

Damiano

Changed 5 years ago by barboni@…

5 request (4 OK and 1 KO)

comment:11 Changed 5 years ago by erodriguez

Thanks for the kind feedback, I'm currently working on the second problem (server overloaded) since the fix should be easy and also on unifying the compilation process of Rasj and Petascope, both problems should be solved soon. I will then start testing the first problem, since there is some uncertainty at the moment regarding where the problem resides.

Regards,

Ernesto Rodriguez

comment:12 Changed 5 years ago by pcampalani

5 requests (3 OK and 2 KO).

A little patch on test_parallel_requests.py:

--- 97     xml = "ERROR " %(r_status)
+++ 97     xml = "ERROR %s" %(r_status)

--- 116    data_Out      =  commands.getstatusoutput(rasdaman_request)
+++ 116    data_Out      =  str(commands.getstatusoutput(rasdaman_request))

Piero

comment:13 Changed 5 years ago by barboni@…

I updated the script "test_parallel_requests.py" to execute also wcps requests (abstract syntax)

Change at the beginning of the file the global variables hostname, port, coverage_id ecc with yours

Ciao

Damiano

USAGE:
python test_parallel_requests.py APP N

APP can be:

"ras" for RASDAMAN,
"wcs" for PETASCOPE wcs2.0,
"wcps" for PETASCOPE wcps abstract syntax

N is the number of requests generated

Changed 5 years ago by barboni@…

new version with wcps support

comment:14 Changed 5 years ago by dmisev

So what's the status, Ernesto? Is this ticket fixed?

comment:15 Changed 5 years ago by erodriguez

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:16 Changed 4 years ago by abeccati

  • Keywords concurrence scalability added
  • Priority changed from major to blocker
  • Resolution fixed deleted
  • Status changed from closed to reopened

Ticket reopened upon notification by barboni@…

comment:17 Changed 4 years ago by dmisev

Any more details? :)

comment:18 Changed 4 years ago by abeccati

  • Cc a.beccati@… p.campalani@… added; beccati@… removed
  • Owner changed from erodriguez to dmisev
  • Status changed from reopened to assigned

Damiano should provide an updated script to run the tests.

First we should integrate the provided test into the system tests (jin ?) to ensure no accidental reintroduction, then fix it.

comment:19 Changed 4 years ago by dmisev

There is a test, source:systemtest/testcases_open/test_rasjConcurrency but it's failing as far as I've noticed.

Changed 4 years ago by barboni@…

New test script with python unittest library

comment:20 Changed 4 years ago by barboni@…

Hi Alan,

I loaded the test script also in this track. I saw that there are other scripts attached to this ticket, probably is better to use the last version (TestParallelRequests?.py), it's more essential and clean and make use of the python unittest library.

As mentioned in the ES track, i have an old petascope.war that works but I don't remember what was the related git commit. Tell me if you know a way to retrieve the version from a compiled petascope, so it will be possible to define a date range where the bag was reintroduced.

Damiano

comment:21 Changed 4 years ago by pcampalani

  • Owner changed from dmisev to pcampalani

comment:22 Changed 4 years ago by abeccati

  • Milestone set to 8.4

comment:23 Changed 4 years ago by pcampalani

Up to now I can tell that the last working commit (I tried) is:

commit 071e0a0065d8781e07820ca13b80926e7ca6a39c
Author: Alireza Rezaei Mahdiraji <a.rezaeim@jacobs-university.de>
Date:   Wed Aug 15 14:54:17 2012 +0200

    Using constants in WCPS (part4)

comment:24 Changed 4 years ago by dmisev

  • Cc arezaeim added

Perhaps some constant has been misplaced?

comment:25 Changed 4 years ago by pcampalani

That commit is working, so the introduction of the constant should not be the problem.

I just listed this temporary result, I'll continue later on discovering the watershed commit.

comment:26 Changed 4 years ago by pcampalani

  • Cc mrusu added

First non-working commit:

commit 42a68acc5a7c7a59c5c6ceb37ac825a3b7ba2f55
Author: Mihaela Rusu <m.rusu@jacobs-university.de>
Date:   Wed Oct 3 17:29:02 2012 +0200

    inserting metadata into wcs2 response

So, one problem seems to lie in the new getImageMetadata() method, which throws the SQLException ("Error retrieving metadata for coverage").

The other error message ("Metadata for coverage '%' is not valid") originates in the DbMetadataSource.read() method. It is called by WcsUtil.java which however hides the details so it is not clear which exception is thrown by the former: only coverageId is propagated, whereas the incoming exception message is ignored.

The method DbMetadataSource.getImageMetadata() in not called inside DbMetadataSource.read(), however no other modifications to DbMetadataSource.java where done in that patch, so I really guess the other error are anyway caused by the clash with the query for the metadata (ps_metadata).

I can have a look at this next week.

comment:27 Changed 4 years ago by abeccati

Very good, this bisect tool is really well thought ;)
Next week is ok and it would get straight into 8.4

comment:28 Changed 4 years ago by pcampalani

As a quick workaround, in case the gmlcov:metadata element is not needed in the WCS responses, one could comment out the body of the petascope.core.DbMetadataSource.getImageMetadata() method, forcing it to return an empty String "".

comment:29 Changed 4 years ago by dmisev

I can see the problem with the getImageMetadata, it's called and handled separately instead of being read into the GetCoverageMetadata?.. Not sure if reading it in the same way as all the other coverage info (in DbMetadataSource?.read(..)) will fix this problem, but we should anyway remove the getImageMetadata and move the code to the read() method.

comment:30 Changed 4 years ago by dmisev

This is what should be moved to the read() somewhere:

            ResultSet r = s.executeQuery("SELECT metadata FROM ps_metadata WHERE coverage in "
                     + "(SELECT id FROM ps_coverage WHERE name = '" + coverageId + "')");
            if (r.next()) {
                return r.getString("metadata");
            } else {
                return "";
            }

comment:31 Changed 4 years ago by dmisev

@Piero: you may be getting those errors because you haven't run update_petascopedb.sh on your database, i.e. you don't have the new metadata table? And perhaps that is also causing your multithreading test to fail, i.e. this commit may not be the problematic one for this particular case, especially because I don't see any code that could cause multithreading issues.

comment:32 Changed 4 years ago by pcampalani

Dimitar,
yes, my db is updated with ps_metadata (I don't think Petascope deploys otherwise).

Don't you get the errors with TestParallelRequests.py as well?

Anyway, the solution you proposed is probably what is meant to be to fix the ticket.

comment:33 Changed 4 years ago by dmisev

Seems to work for me, but all the ports are closed here and I can't pull and make the patch against the latest rasdaman, so attaching it here and Piero you can redo it on the latest code.

comment:34 Changed 4 years ago by dmisev

would be also nice to integrate the python test into the test_wcs systemtest, it seems to work really well (thanks Damiano!), or fix the test_rasjConcurrency.

comment:35 Changed 4 years ago by pcampalani

Tested your patch against on latest rasdaman: it works.

comment:36 Changed 4 years ago by dmisev

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:37 Changed 4 years ago by barboni@…

tested today
now petascope works

thanks and ciao

damiano

comment:38 Changed 4 years ago by dmisev

  • Cc abeccati pbaumann pcampalani damiano mantovani added; a.beccati@… p.campalani@… arezaeim mrusu removed
  • Complexity set to Medium
  • Milestone changed from 8.4 to 9.0
  • Priority changed from blocker to major
  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:39 Changed 4 years ago by dmisev

  • Owner changed from pcampalani to nkolev
  • Status changed from reopened to assigned

Reopening as it seems the handling of simultaneous connections is quite unreliable in petascope, especially when there are more connections than there are rasservers available.

The test script to be used for testing is attachment:TestParallelRequests.py e.g. try with 10 rasservers and N_REQUEST = 30

comment:40 Changed 4 years ago by dmisev

  • Cc sebrecht added

Furthermore rasserver dies (i.e. becomes unavailable), when there are too many connections exceeding the max_connections values in postgresql.conf. In the rasdaman log I see SQL error codes 402 and 220, they should be handled better instead of rasserver becoming unavailable.

Same can happen with petascope as well, depending on which one exceeds the threshold first I guess.

Last edited 4 years ago by dmisev (previous) (diff)

comment:41 Changed 3 years ago by pcampalani

UPDATE (see also #795)

Thread-safety seems damaged from v9.0.1 to v9.0.2.
Using attachment:TestParallelRequests.py (thanks damiano for the update for v9):

> grep ^WCS_request TestParallelRequests.py 
WCS_request    = "/rasdaman/ows/wcs2?service=WCS&Request=GetCoverage&version=2.0.1&coverageid=mr&subset=i(0,10)&subset=j(0,1)"
> $ grep ^N_REQUEST TestParallelRequests.py 
N_REQUEST      = 50
> python TestParallelRequests.py
> egrep "404|500" . -r | wc -l

At MEEO Srl v9.0.1 is stable.
Personally on my laptop v9.0.1 shoots a lot of 'RasdamanUnavailable: Unable to get a free rasdaman server', while v9.0.2 only fails < 10%.

comment:42 Changed 3 years ago by dmisev

Piero can you try to isolate the commit between 9.0.1 and 9.0.2 that causes this? But first I guess we have to reproduce the issue.

comment:43 Changed 3 years ago by dmisev

  • Resolution set to fixed
  • Status changed from assigned to closed

I guess this is fixed.

Note: See TracTickets for help on using tickets.