Welcome to the SRP Forum! Please refer to the SRP Forum FAQ post if you have any questions regarding how the forum works.

Reliability issue with Web interface using OEGCI talking to 9.4 OI OESocketServer.jar (via Service)


Has anyone experienced issues recently with the above setup?

Basically, we have had a couple of clients that have been running the above setup in production without issue for quite a while. All of a sudden, the last few weeks or so, communication hangs regularly. However, a simple restart of the wrapper service for oesocketserver restores access immediately. I should note the wrapper service is still listed as running and not stopped.

The server setup looks something like this:
  • Win Server 2012 R2 with IIS 8 (their IT service company appears to be on the ball when it comes to patch maintenance). We also have another client on 2016 Server that is also experiencing some recent flakiness that may, or may not be, ultimately the same issue.
  • oecgi3 (was a renamed oecgi 4.0.0.1). I have just changed this to oecgi 4.0.0.3, with no apparent stability improvement. (We require oecgi3 as it is hardcoded LOTS in some deprecated legacy code)
  • OI 9.4 (I didn’t run a full update on OI to 9.4.4 but I tried with the latest oesocketserver.jar from patch 5.1)
  • Was running Oracle JRE 8.211. Upgraded to 8.241, and then replaced with Adopt OpenJDK. This was not more stable.
  • Wrapper. Nothing obvious in the logs (that hasn’t been there for four years).
STATUS | wrapper | 2020/03/09 11:14:10 | --> Wrapper Started as Service STATUS | wrapper | 2020/03/09 11:14:11 | Launching a JVM... INFO | jvm 1 | 2020/03/09 11:14:11 | Wrapper (Version 3.1.2) http://wrapper.tanukisoftware.org INFO | jvm 1 | 2020/03/09 11:14:11 | INFO | jvm 1 | 2020/03/09 11:14:11 | WARNING - System.in can not be used when the JVM is being controlled by the Java Service Wrapper. Calls will block indefinitely. INFO | jvm 1 | 2020/03/09 11:14:11 | Version: 3.0.0.411 - Licensed for use to CN=Revelation Software INFO | jvm 1 | 2020/03/09 11:14:11 | Started at 2020-03-09 11:14:11
Again, I feel I should note that it was working fine on the original configuration for ages and then just started to be flaky.
I know these are pretty broad brushstrokes at the moment but I am just putting it out there in case a lightbulb goes off for somebody who may have experienced a similar issue…

Comments

  • We have not seen any problems like you describe. However, I want to get clarity on what you are describing as a "communication hang". Are you saying that the HTTP request is made but no response is received? Have you been able to confirm whether or not the request reaches OpenEngine (i.e., calls your BASIC+ code)? If there is a new problem, it's less likely to be an OECGI issue and more likely to be an OEngineServer issue. It will be a lot easier to diagnose if we can confirm how far the incoming request reaches.
  • Sorry for the terminology Don. Chrome Dev tools had indicated that the website was trying but not getting a response. A OECGI/inet_trace did not get a successful response (until the service was restarted. Then it is fine for a while). I can't remember the exact exception returned when response was down but it was something other than success. And it was an IIS error page and not a CGI return error (like a 401 Unauthorized or some such thing).

    My feel is OEngineServer issue rather than OECGI but I am certainly not across the ins and outs of how the whole structure communicates. I am effectively just restarting the DB 'listener' (for want of a better term) while doing nothing to the 'caller' (OECGI) so it would appear that this issue is past the point of OECGI.

    As this has been happening on a production site I am getting less and less time to intelligence gather since they are getting more and more agitated with the downtime. In those early days I stoles a few minutes to collect information while the issue was occurring. Now I need to restart that service pretty quickly to restore law and order!

    I probably should of posted here for ideas a week or 2 ago :(

  • This is a INET_TRACE return while the service is responsive. (I have santised it somewhat to remove site specific info)

    CONTENT_LENGTH = 0 CONTENT_TYPE = GATEWAY_INTERFACE = CGI/1.1 HTTPS = off HTTP_ACCEPT = text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9 HTTP_COOKIE = HTTP_FROM = HTTP_REFERER = HTTP_USER_AGENT = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/82.0.4077.0 Safari/537.36 PATH_INFO = /inet_trace PATH_TRANSLATED = C:\inetpub\wwwroot\$PATH\inet_trace QUERY_STRING = REMOTE_ADDR = xxx.xxx.xxx.xxx REMOTE_HOST = xxx.xxx.xxx.xxx REMOTE_IDENT = REMOTE_USER = REQUEST_METHOD = GET SCRIPT_NAME = /cgi-bin/oecgi3.exe SERVER_NAME = yyy.yyy.yyy.yyy SERVER_PORT = zzzz SERVER_PROTOCOL = HTTP/1.1 SERVER_SOFTWARE = Microsoft-IIS/8.5 SERVER_URL = SERVER_SERIAL = $SERIAL RegistryKey = SOFTWARE\RevSoft\OECGI3 EngineName = ServerURL = localhost ServerPort = $PORT ApplicationName = $APP_NAME UserName = $USER_NAME StartupFlags = 65 ShutdownFlags = 1 FileMode = 1 FilePath =
  • So this is site is running RUN_OECGI_REQUEST rather than HTTP_MCP? This shouldn't matter, but just confirming because HTTP_MCP would give you internal logs to reference.

    The fact that INET_TRACE fails suggests to me that the engine isn't getting called. Have you inspected the IIS logs to see what they reveal?

    Are you running the OEngineServer is debug mode or as a service? If as a service, consider switching to debug mode so you can watch the console and the engines when the system is unresponsive. It will be telling to see if either or both get activity.
  • I think we tracked it down. Ultimately there were 2 issues. The first (the 'hung' system I mentioned) masked the second (and actual root cause). I think when enough of the root issue happened it eventually brought down the service, which was all we would see by the time we bacme involced.

    Running the OEngineServer in debug mode once I had an active report of that root issue eventually led to the solution as I was then able to trace through the entire process. It was not quite as direct as it should of been due to the nature of the specific setup but ultimately it turned out to be some 'hidden' corrupt indexes.

    AS an aside, I did some investigation into the Debug Intercept mode you mentioned Don. I will keep that in the back of my mind... aka 'potential toolbox'!
Sign In or Register to comment.