[go: up one dir, main page]

Activity for collectl

  • Bill Torpey Bill Torpey posted a comment on discussion Help

    Thanks Laurence, Mark for the quick replies, and my apologies for not getting back sooner. In the meantime, I've been combing through the docs for collectl, /proc, etc. to see if there's another way to get the data I'm looking for, or if I've misunderstood something along the way. To backtrack a bit, my initial question about shared memory was because I wanted to be certain that shared memory didn't get included in some of the other metrics that collectl collects from /proc. We really wanted to monitor...

  • Laurence Oberman Laurence Oberman posted a comment on discussion Help

    I don't think its generic memory details Bill is after, rather the shared memory specifics per process. collectl does not capture that granular data. I would have to look at adding the /proc/PID/smaps stuff etc. Regards Laurence

  • Mark Seger Mark Seger posted a comment on discussion Help

    Amazing how much one forgets after being retired for a few years, but... There is actually a LOT of pre-process memory stats gathers, which you can display if you run collectl --showsubopts you see values for --procopts; --procopts c - include cpu time of children who have exited (same as ps S switch) f - use cumulative totals for page faults in proc data instead of rates i - show io counters in display I - disable collection/display of I/O stats. saves over 25% in data collection overhead k - remove...

  • Laurence Oberman Laurence Oberman posted a comment on discussion Help

    Hello Are you asking about the process related memory view via sZ, its not available there. collectl gets the entire /proc/meminfo capture but that wont help with shared memory per process, will just show shared. Collectl can't really help here What you need is the ipcs command to understand what shared memory is reserved. The best you can do is ipcs -c -m -p root@loberhel ~]# ipcs -c -m -p ------ Shared Memory Creator/Last-op PIDs -------- shmid owner cpid lpid 94175235 loberman 10430 2489 89522180...

  • Bill Torpey Bill Torpey posted a comment on discussion Help

    I'm trying to figure out a way to get processes' shared memory usage from collectl, and I'm not having much luck. Our application connects directly to a database which uses shared memory, and I'd like to exclude this memory from the amount used, but I can't find it anywhere. I see some comments in the code (e.g., "added nr_shmem to memory -P stats") but don't know where to go from there. Any help would be most appreciated!

  • James R. Waters James R. Waters posted a comment on discussion Help

    Like I said above I installed collectl to get more info of resource usage. My server boots and tries to throw up a splash screen of some kind (wasn't happening before installing collectl) then my monitor loses signal. I have no idea what's going on and Google hasn't been any help. Let me know if you need any more info, any help would be appreciated.

  • collectl collectl updated /collectl/collectl-4.3.2/collectl-4.3.2.src.tar.gz

  • collectl collectl released /collectl/collectl-4.3.2/collectl-4.3.2.src.tar.gz

  • collectl collectl released /collectl/collectl-4.3.2.src.tar.gz

  • Bill Torpey Bill Torpey posted a comment on discussion Open Discussion

    Brilliant! Thanks so much! I had no idea that --procopts w took a parameter, but that does the trick. P.S. As for the unzipping the raw.gz file -- that worked a treat also. Thanks again to you and Mark for a terrific piece of software!

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    I will look into this Regards Laurence

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    Hello I tested this without --procopts -w2000 I get the truncated cmdline 17:51:00 9314 loberman 20 8398 0 S 110M 1M 4 0.00 0.00 0 00:00.00 0 0 0 3 /bin/bash ./coltest.sh 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200123...

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    Subscribed, Hopefully I will figure out why I dont get the messages lately

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    Hello If you want to look at a raw.gz that was not properly closed you will need to do this cat xxx.raw.gz | gunzip > xxx.raw Then you can check the argument length I looked at the code, the limit is not in the gathering, rather in the playback its set to 1000 if no other options are given for --procopts collectl: $procCmdWidth=($procOpts=~s/w(\d+)/w/) ? $1 : 1000; w - widen display by including whole argument string, with optional max width *** I have not tested this but if you include --procopts...

  • Bill Torpey Bill Torpey posted a comment on discussion Open Discussion

    I don't know what collectl is storing internally -- I've tried unzipping the .raw.gz file, but I get "unexpected end of file".

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    Hello I am not sure why I never saw this on the list. As far as I know I am subscribed. I will look into this. i.e. Is the truncation within actual collectl data gathering or only on playback. If the limit is within collectl gather and not playback then in increasing it would mean practically growing the on-disk data capture size so will have to think about that because if we have many 1000''s of tasks it's a lot of impact on storage and capture. Regards Laurence On Mon, Jan 10, 2022 at 4:04 PM Mark...

  • Bill Torpey Bill Torpey posted a comment on discussion Open Discussion

    Thanks. This has nothing to do with java per se -- it's just that our java processes have really long command lines -- when we get them from collectl they are truncated after approx 1K characters. The question is whether collectl stores the full command line, and if so, whether there is any way to get the full command line from collectl.

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    The short answer is I have no idea AND I know nothing about jara. Might it be possible to run a java script that in turns runs a script that runs collectl? I've also been retired a few years and support of collectl has been taken on by redhat. Perhaps laurence can point you in the right direction. -mark On Mon, Jan 10, 2022 at 10:49 AM Bill Torpey wallstprog@users.sourceforge.net wrote: When extracting data from raw.gz. file (e.g., with "--procopts w -sZ -oD") it appears that the command string is...

  • Bill Torpey Bill Torpey posted a comment on discussion Open Discussion

    When extracting data from raw.gz. file (e.g., with "--procopts w -sZ -oD") it appears that the command string is truncated after approx. 1K characters. In our case, we launch java processes with very long --classpath parameters, so we lose the last part of the command line, which for us is the most significant. Is there any way to get collectl to output the full command line? P.S. I do understand that we can avoid this problem by using e.g., CLASSPATH environment variable -- that would require changes...

  • nqq nqq posted a comment on discussion Open Discussion

    I've found the reason. Here it is https://sourceforge.net/p/collectl/discussion/696864/thread/14f35468/

  • nqq nqq posted a comment on discussion Open Discussion

    Hello. When I used collectl to collect net stats on centOS KVM, I found every stat is 0 : [root@localhost HybridScheduling]# collectl -s-cdn+N -i 1 waiting for 1 second sample... Bogus data record skipped for NET:eth0: data on 20211214 at 11:26:58 # NETWORK STATISTICS (/sec) #Num Name KBIn PktIn SizeIn MultI CmpI ErrsI KBOut PktOut SizeO CmpO ErrsO 0 lo 0 0 0 0 0 0 0 0 0 0 0 1 eth0 0 0 0 0 0 0 0 0 0 0 0 2 docker0 0 0 0 0 0 0 0 0 0 0 0 Bogus data record skipped for NET:eth0: data on 20211214 at 11:26:59...

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    I committed the last patch, will revert if it still has issues, but cant see why this would not work, given the test script I wrote. commit 19e977b7a8fa2c93575b5f9247d9a1250f031cb9 Author: Laurence Oberman <loberman@redhat.com> Date: Fri Oct 8 11:35:52 2021 -0400 Deal with the bug where last was called in the perf_query code for Infiniband and was not in a loop https://sourceforge.net/p/collectl/discussion/696864/thread/fc58a31168/?limit=25#2f7c diff --git a/formatit.ph b/formatit.ph index db1776eb5385..045eef85dd6d...

  • Edgar Dean Edgar Dean posted a comment on discussion Open Discussion

    Hi Laurence, Thank you, will get this tested and revert with an update. Regards.

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    Apologies, Looks like I put the label in an unreachable place. Please try this, attached patched file New Patch --- formatit.ph.orig 2021-10-05 09:42:58.307749832 -0400 +++ formatit.ph 2021-10-06 10:57:13.969606012 -0400 @@ -375,12 +375,12 @@ sub initRecord $message="Required module missing" if $temp=~/required by/; $message="No such file or directory" if $temp=~/No such file/; if ($message ne '') - { + {EXIT_IF:{ disableSubsys('x', "perfquery error: $message!"); $mellanoxFlag=0; $PQuery=''; - last;...

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    Hello Thanks for testing Let me review and figure out what is wrong, I will emulate the issue so I can test it myself this time. Regards laurence On Wed, Oct 6, 2021 at 10:37 AM Edgar Dean vonschae@users.sourceforge.net wrote: Hi Lawrence, We applied the fix file and restarted the service however we still getting errors: systemctl status collectl collectl.service - collectl metric collection Loaded: loaded (/usr/lib/systemd/system/collectl.service; enabled; vendor preset: disabled) Active: failed...

  • Edgar Dean Edgar Dean posted a comment on discussion Open Discussion

    Hi Lawrence, We applied the fix file and restarted the service however we still getting errors: systemctl status collectl collectl.service - collectl metric collection Loaded: loaded (/usr/lib/systemd/system/collectl.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Wed 2021-10-06 09:56:37 IST; 11s ago Process: 109723 ExecStart=/usr/bin/collectl -D (code=exited, status=25) Oct 06 09:56:36 serverx systemd[1]: Starting collectl metric collection... Oct 06 09:56:37...

  • Edgar Dean Edgar Dean posted a comment on discussion Open Discussion

    Hi Laurence, Thank you. The patched version will be tested and an update provided shortly. Regards

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    Hello If I properly understood Marks suggestion then a fix like this patch should work. Can you please try it and get back to me. I don't have a way to reproduce to test myself right now. --- formatit.ph 2021-10-04 19:42:48.898289477 -0400 +++ formatit.ph.fix 2021-10-04 19:42:32.366220471 -0400 @@ -342,6 +342,7 @@ sub initRecord $PQopt = 'sys' if -e "$firstHCA/counters_ext" && !($debug & 16384); # We usually only care about perfquery for non-extended counters + EXIT_IF: if ($PQopt eq '-r') { # no...

  • Laurence Oberman Laurence Oberman posted a comment on discussion Open Discussion

    Hello Looking at Marks suggested fix tonight. I see that its uncommon to hit this as it requires the system have IB etc. We are still in the process of deciding how to handle the fixes for collectl upstream but I have added myself now to the email list so I see the bug reports etc. Regards Laurence Oberman

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Looks like laurence and others at redhat are going to be taking over support/dev for collectl so maybe he can comment -mark On Mon, Oct 4, 2021, 8:41 PM Edgar Dean vonschae@users.sourceforge.net wrote: Hi Andrew and Mark, Im running into a similar fault footprint on multiple servers. Where can an updated/corrected formatit.ph file be obtained from that has the correct code fix? Regards. Can't "last" outside a loop block at /usr/share/collectl/formatit.ph line 382 https://sourceforge.net/p/collectl/discussion/696864/thread/fc58a31168/?limit=25#8476...

  • Edgar Dean Edgar Dean posted a comment on discussion Open Discussion

    Hi Andrew and Mark, Im running into a similar fault footprint on multiple servers. Where can an updated/corrected formatit.ph file be obtained from that has the correct code fix? Regards.

  • Mark Seger Mark Seger posted a comment on discussion Help

    Sorry for the late reply. Also since retiring a few years ago I no longer have access to a system on which to do any testing. That said, looking at the code in formatit.ph, I see the line if ($fields[0]=~/^(.*)(fan.*?|temp.*?|power meter.*?)\s*(\d*)(.*)$/i) which would match strings with temp and fan in them (as well as others). A guess would be to change those to whatever your sensors are called. Nor sure what's happening with the Power Meter. also, if you run with --envdebug it will print out debugging...

  • Ihsan Sarfraz Ihsan Sarfraz posted a comment on discussion Help

    Hi, I am trying to collect environmental data (fan, temperature, power). I am facing two issues: I have noticed that collectl service expects the sensor names from which the temprature and fan readings are captured to have a prefix of Temp-* and Fan-*. However, my servers temperature related sensors do not start with Temp prefix. What would be a prudent approach to modify the source code to accomodate this change? For one of my servers, I am not able to get the power reading even though the Power...

  • Mark Seger Mark Seger posted a comment on discussion Help

    Technically if you can write some code to do what you want you can integrate it yourself as a collectl plugin. But you're really on your own. -mark On Wed, Nov 18, 2020, 2:42 PM Magnus Musngi magnusmusngi@users.sourceforge.net wrote: No worries. Thanks. I was thinking, would it be possible to integrate lm-sensors to collectl, such that the temperature and fan speed could be captured through lm-sensors instead of ipmitool? Thanks Environmental monitoring without IPMI https://sourceforge.net/p/collectl/discussion/696865/thread/de314326a4/?limit=25#0c1c...

  • Magnus Musngi Magnus Musngi posted a comment on discussion Help

    No worries. Thanks. I was thinking, would it be possible to integrate lm-sensors to collectl, such that the temperature and fan speed could be captured through lm-sensors instead of ipmitool? Thanks

  • Mark Seger Mark Seger posted a comment on discussion Help

    Unfortunately, no. Sorry about that -mark On Tue, Nov 17, 2020, 4:10 PM Magnus Musngi magnusmusngi@users.sourceforge.net wrote: Hi, In addition to other data, I was planning to capture temperature and fan speed using collectl. However, the computer I'm using does not seem to support IPMI. Is there an alternative way to capture temperature and fan speed using collectl without the use of IPMI. Thanks, Magnus Environmental monitoring without IPMI https://sourceforge.net/p/collectl/discussion/696865/thread/de314326a4/?limit=25#7628...

  • Magnus Musngi Magnus Musngi posted a comment on discussion Help

    Hi, In addition to other data, I was planning to capture temperature and fan speed using collectl. However, the computer I'm using does not seem to support IPMI. Is there an alternative way to capture temperature and fan speed using collectl without the use of IPMI. Thanks, Magnus

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    The answer is is several fields, the two main ones are the system time and the user time, which taken together are 1.46 seconds. 1.46 seconds over a 1 second interval is 146%. So how did you use that much time? Are you breaking the laws of physics? Check out the THD field. Your process is using 79 threads, so between them all you ARE using more than 1 second of cpu. You are also right about only having 4 cpus. If you added up all you pcts for all processes, their total should be < 400, I hope :)...

  • Ron Ron modified a comment on discussion Open Discussion

    I have been trying to find more information / knowledge with regards to the use of "Pct" in the collectl -sZ command: $ collectl -sZ -i.1:1 -R2m & # PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF Command 32205 AAAA 20 1 79 S 4G 305M 2 0.05 1.41 146 03:00.20 0 0 0 3743 java 32091 BBBB 20 1 52 S 3G 267M 1 0.01 0.88 89 01:03.52 0 0 0 1231 /usr/bin/java 542 CCCC 20 1 122 S 5G 828M 3 0.05 0.30 35 04:24.20 0 0 0 111 /usr/bin/java 32242 DDDD 20 1 59 S 4G 421M 3 0.00 0.28 28...

  • Ron Ron modified a comment on discussion Open Discussion

    I have been trying to find more information / knowledge with regards to the use of "Pct" in the collectl -sZ command: $ collectl -sZ -i.1:1 -R2m & # PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF Command 32205 AAAA 20 1 79 S 4G 305M 2 0.05 1.41 146 03:00.20 0 0 0 3743 java 32091 BBBB 20 1 52 S 3G 267M 1 0.01 0.88 89 01:03.52 0 0 0 1231 /usr/bin/java 542 CCCC 20 1 122 S 5G 828M 3 0.05 0.30 35 04:24.20 0 0 0 111 /usr/bin/java 32242 DDDD 20 1 59 S 4G 421M 3 0.00 0.28 28...

  • Ron Ron modified a comment on discussion Open Discussion

    I have been trying to find more information / knowledge with regards to the use of "Pct" in the collectl -sZ command: $ collectl -sZ -i.1:1 -R2m & # PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF Command 32205 AAAA 20 1 79 S 4G 305M 2 0.05 1.41 146 03:00.20 0 0 0 3743 java 32091 BBBB 20 1 52 S 3G 267M 1 0.01 0.88 89 01:03.52 0 0 0 1231 /usr/bin/java 542 CCCC 20 1 122 S 5G 828M 3 0.05 0.30 35 04:24.20 0 0 0 111 /usr/bin/java 32242 DDDD 20 1 59 S 4G 421M 3 0.00 0.28 28...

  • Ron Ron posted a comment on discussion Open Discussion

    I have been trying to find more information / knowledge with regards to the use of "Pct" in the collectl -sZ command: $ collectl -sZ -i.1:1 -R2m & # PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF Command 32205 AAAA 20 1 79 S 4G 305M 2 0.05 1.41 146 03:00.20 0 0 0 3743 java 32091 BBBB 20 1 52 S 3G 267M 1 0.01 0.88 89 01:03.52 0 0 0 1231 /usr/bin/java 542 CCCC 20 1 122 S 5G 828M 3 0.05 0.30 35 04:24.20 0 0 0 111 /usr/bin/java 32242 DDDD 20 1 59 S 4G 421M 3 0.00 0.28 28...

  • Adam S Adam S posted a comment on discussion Help

    I found the answer by finding graphite.ph -- i needed to add the "f" flag to have the graphite output have the fqdn

  • Adam S Adam S posted a comment on discussion Help

    How does collectl get the hostname? On rhel6 the system always gave the FQDN, however on rhel7+ it only gives the short name unless you do hostname -f Due to how our monitoring is written i need to figure out how to get collectl to use the fqdn. thanks!

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Roland - i didnt notice your name on the email. Sorry about that. I think you were one of the earliest collectl users -mark On Wed, Aug 14, 2019, 11:10 AM Roland Laifer rlaifer@users.sourceforge.net wrote: Hi Johann, I've tested the collectl patches and it works indeed with Lustre 2.10. It is great news that I can use collectl again and find out on CLI what users are doing in Lustre! @Mark: Thank you for all the support you gave! Enjoy your retirement! Hopefully someone will chain in so that we are...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Glad to hear you've been enjoying collectl. Do you colmux? A lot of people don't and they dont know what their missing. With colmux I've monitored 100 lustre servers in parallel, think top, and can sort on any column. Makes it really easy to find a slow server or maybe one running at 100% CPU. Try it out if you haven't yet. You'll never look at cluster without it again -mark On Wed, Aug 14, 2019, 11:10 AM Roland Laifer rlaifer@users.sourceforge.net wrote: Hi Johann, I've tested the collectl patches...

  • Roland Laifer Roland Laifer posted a comment on discussion Open Discussion

    Hi Johann, I've tested the collectl patches and it works indeed with Lustre 2.10. It is great news that I can use collectl again and find out on CLI what users are doing in Lustre! @Mark: Thank you for all the support you gave! Enjoy your retirement! Hopefully someone will chain in so that we are further able to use this great tool. Regards, Roland

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Currently only i have access as collectl has always been a 1 person show -mark On Wed, Aug 14, 2019, 8:16 AM johann peyrard yoyz2k@users.sourceforge.net wrote: Hi Mark, I have not seen this email you sent. Unfortunately, if there is no one behind collectl right now, I don't know how to push this patch. But I can easily test the collectl feature easily. Maybe there is something doable on this area. Who has the commit access today on collectl ? I may be able to spend 30 minutes on it from time to time...

  • johann peyrard johann peyrard posted a comment on discussion Open Discussion

    Hi Mark, I have not seen this email you sent. Unfortunately, if there is no one behind collectl right now, I don't know how to push this patch. But I can easily test the collectl feature easily. Maybe there is something doable on this area. Who has the commit access today on collectl ? I may be able to spend 30 minutes on it from time to time on this.

  • johann peyrard johann peyrard posted a comment on discussion Open Discussion

    Hi Mark, I have not seen this email you sent. Unfortunately, if there is no one behind collectl right now, I don't know how to push this patch. But I can easily test the collectl feature easily. Maybe there is something doable on this area. Who has the commit access today on collectl ? I may be able to spend 30 minutes on it from time to time on this.

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    good to hear, but if you've seen the last email I sent out I retired a few months ago and am no longer in a position to support collectl. Always happy to answers questions but I just can't make code changes and test them, especially for lustre since I haven't had access to a system running lustre for many years. Also I'm suite surprised to hear you can make my old code work (glad I didn't remove it) as I thought that mechanism had been dropped. Maybe if/when someone raises their hand to support collectl...

  • johann peyrard johann peyrard modified a comment on discussion Open Discussion

    [root@node1 data]# dd if=/dev/zero of=bla bs=1M ^C 7449+0 records in 7449+0 records out 7810842624 bytes (7.8 GB) copied, 5.79551 s, 1.3 GB/s [root@node1 collectl-4.3.1-peyrardj]# ./collectl -slx waiting for 1 second sample... #<-----------InfiniBand-----------><--------Lustre Client--------> # KBIn PktIn KBOut PktOut Errs KBRead Reads KBWrite Writes 13 72 14 82 0 0 0 0 0 6251 157237 629427 157241 0 0 0 330058 322 26489 669404 2680906 669728 0 0 0 1333248 1302 26169 660456 2647079 660176 0 0 0 1296384...

  • johann peyrard johann peyrard modified a comment on discussion Open Discussion

    [root@node1 data]# dd if=/dev/zero of=bla bs=1M ^C 7449+0 records in 7449+0 records out 7810842624 bytes (7.8 GB) copied, 5.79551 s, 1.3 GB/s [root@node1 collectl-4.3.1-peyrardj]# ./collectl -slx waiting for 1 second sample... #<-----------InfiniBand-----------><--------Lustre Client--------> # KBIn PktIn KBOut PktOut Errs KBRead Reads KBWrite Writes 13 72 14 82 0 0 0 0 0 6251 157237 629427 157241 0 0 0 330058 322 26489 669404 2680906 669728 0 0 0 1333248 1302 26169 660456 2647079 660176 0 0 0 1296384...

  • johann peyrard johann peyrard posted a comment on discussion Open Discussion

    [root@node1 data]# dd if=/dev/zero of=bla bs=1M ^C 7449+0 records in 7449+0 records out 7810842624 bytes (7.8 GB) copied, 5.79551 s, 1.3 GB/s [root@node2 collectl-4.3.1-peyrardj]# ./collectl -slx waiting for 1 second sample... #<-----------InfiniBand-----------><--------Lustre Client--------> # KBIn PktIn KBOut PktOut Errs KBRead Reads KBWrite Writes 13 72 14 82 0 0 0 0 0 6251 157237 629427 157241 0 0 0 330058 322 26489 669404 2680906 669728 0 0 0 1333248 1302 26169 660456 2647079 660176 0 0 0 1296384...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    good questions. there are so many conventions on different systems I'm losing track as far as auto-start goes and am not sure any more about ubuntu. If it does start automatically I'm sure there's a way to disable. as far as ARM goes, I also can't remember if anyone has tried collectl on arm, I know I haven't, But after close to 20 years I'd think someone has so hopefully they can chime in? It's totally possible collectl might be doing something in its initialization causing issues, I had suggested...

  • jack xu jack xu posted a comment on discussion Open Discussion

    hi,Mark, thanks a lot for yout check. in fact, after run collectl, the ubuntu system can't finish its startup, and it is no any output on screen. the above error message was collected through its console IF(serial interface). so i can't login in system and also can't edit its configue file(/etc/collectl.conf) two other questions: (1) collectl will be started automaticlly when system reboot after install it through "./install" command? (2) does collectl support ubuntu +arm64? or it only support ubuntu+x86...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Do you actually see a collectl startup message? I don't see anything about collectl in the listing. And if you disable collectl collectl from starting does the problem go away? Given you're saying this seems to be hardware specific I've no idea how to approach this w/o my own hardware. I suppose all I could suggest is to let it try starting collectl by collecting minimal information, in other words edit the DaemonCommands line in /etc/collectl.conf and instead of -s+YZ which tells it to collectl...

  • jack xu jack xu modified a comment on discussion Open Discussion

    hi, I met such issue on nvidia px2 (arm64 ubuntu system): nvidia@p21:~$ uname -a Linux p21 4.9.38-rt25-tegra #1 SMP PREEMPT RT Mon Jan 8 01:00:48 PST 2018 aarch64 aarch64 aarch64 GNU/Linux there are some below warning when I installed collectl on it, but i don't think it will affect ubuntu startup: root@p22:/home/nvidia/xuqigang/collectl-4.3.1# ./INSTALL insserv: warning: script 'S02nv-run-date' missing LSB tags and overrides insserv: warning: script 'nv-run-once-install-dwx' missing LSB tags and...

  • jack xu jack xu posted a comment on discussion Open Discussion

    hi, I met such issue on nvidia px2 (arm64 ubuntu system): nvidia@p21:~$ uname -a Linux p21 4.9.38-rt25-tegra #1 SMP PREEMPT RT Mon Jan 8 01:00:48 PST 2018 aarch64 aarch64 aarch64 GNU/Linux the nvidia px2 console print below infomation continuly: [ OK ] Started LSB: initscript. [ OK ] Started LSB: automatic crash report generation. [ OK ] Started CUPS Scheduler. [ OK ] Started CUPS Scheduler. [ OK ] Started Login Service. [ OK ] Reached target Network. Starting Network Name Resolution... Starting...

  • Mark Seger Mark Seger posted a comment on discussion Help

    I didnt do the latest ib work. You'll have to contact Peter Piela, as shown on the SF collectl page. -mark On Mon, Mar 4, 2019, 1:08 PM Haoyu Huang haoyuhuang@users.sourceforge.net wrote: Hi, Thanks for this great tool! I'm curious is the KBIn / KBOut reported by 'collectl -scx' 1024 * 8 bits or 1000 * 8 bits? Thanks, Infiniband Monitoring https://sourceforge.net/p/collectl/discussion/696865/thread/fee1379610/?limit=25#1a72 Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/collectl/discussion/696865/...

  • Haoyu Huang Haoyu Huang posted a comment on discussion Help

    Hi, Thanks for this great tool! I'm curious is the KBIn / KBOut reported by 'collectl -scx' 1024 * 8 bits or 1000 * 8 bits? Thanks,

  • Bill Bill posted a comment on discussion Open Discussion

    Thanks a million! There is enough info for me to work on it. colmux is a greate picece of software!

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Sorry for the delay, I was tied up. Always good to chat with someone who discovered colmux, I'm surprised I don't hear more about it. First and foremost I was to say that I wrote colmux probably 15 years ago and its' always sort of just worked. There might also be better ways to handle the communications between collectl and colmux with more modern perl modules, but it is what it is. That said, getting all this working can be tricky but perhaps you already know that. The number 1 challenge for colmux...

  • Bill Bill posted a comment on discussion Open Discussion

    I am trying to setup collectl + colmux as a "supercharged top". Out of the 15 servers, there are two servers I can't get colmux to work. Not working even I only colmux that one server. Ignoring records from 'SERVER_NAME' which is not recognizable. Is the alias wrong or missing? How can I debug this error? All running the same version of collectl and centos7. "host1" is good. "host2" is the one not working. # collectl -v collectl V4.3.0-1 (zlib:2.061,HiRes:1.9725) # cat /etc/redhat-release CentOS...

  • Andew Garside Andew Garside posted a comment on discussion Open Discussion

    Hi, Mark. It's great to hear from you. I hope all is well. I like your proposal. I also agree with your idea of doing just what you need to do to fix the problem without running the risk of introducing new bugs. While the other "last" statement shouldn't die (since it's contained within an "if" within a "for" loop) I agree it looks like it should jump to a break at line 328, along with the "last" statement I originally commented on. If you want to make the change you're comfortable with and send...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    hey andy, long time no chat... It's not too often when someone bumps into a collectl bug and it's hard to tell how much/little it's actually being used, but I'm amazed I wrote this piece of code ;) I want to be really careful what I do so as not to break anything and while making the logic very clean. Perhaps I'm missing something in looking at your patch with no additional context or explanation but I think what I see if you're adding a block and a label. Then when the last is executed it breaks...

  • Andew Garside Andew Garside posted a comment on discussion Open Discussion

    Hi, I have collectl 4.3.1 installed on RHEL 7.6. When I start it and check the status I get the following: # systemctl start collectl Job for collectl.service failed because the control process exited with error code. See "systemctl status collectl.service" and "journalctl -xe" for details. # systemctl status collectl.service ● collectl.service - collectl metric collection Loaded: loaded (/usr/lib/systemd/system/collectl.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code)...

  • KM5 KM5 posted a comment on discussion Open Discussion

    Hi Mark, this turned out to a type in .bashrc which was throwing up an error and colmux was reading that first. It is fixed now.

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    good to see at least someone else like colmux - don't hear from many people about it. they don't know what they're missing ;) This vaguely sounds like I may have seen it before but really hard to tell. What's going on is when collectl first starts running it asks each machine what version of collectl it's running followed by the date/time it is and tries to make sure they're all very close in time. It looks like the parsing of the date string is blowing up and I've no idea why. The easiest way to...

  • KM5 KM5 posted a comment on discussion Open Discussion

    [memsql@txslmemsqlpd1 ~]$ colmux -hostwidth 20 -address all_nodes -command '-scmnd' Argument "e/" isn't numeric in subtraction (-) at /usr/bin/colmux line 450. Argument "/hom" isn't numeric in subtraction (-) at /usr/bin/colmux line 450. Month '-1' out of range 0..11 at /usr/bin/colmux line 450. Perl exited with active threads: 0 running and unjoined 10 finished and unjoined 0 running and detached [memsql@txslmemsqlpd1 ~]$ Not seen this before. These are dual homed systems. collectl works. [memsql@txslmemsqlpd1...

  • collectl collectl released /collectl/collectl-4.3.1/collectl-4.3.1.src.tar.gz

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    I did push out a new version yesterday -mark On Wed, Oct 31, 2018 at 4:21 AM Jan Schreiber jccs@users.sourceforge.net wrote: Hello Mark, what you are displaying is not a wait time but a wait time by IO, calculated in line 3432 in formatit.ph. The ever increasing values in /proc/diskstats need to have their delta calculated for every intervall, providing a cumulative IO time and the number of IOs. The first gets divided by the second, resulting in msec/IO, which is what we want. Dividing the IO time...

  • collectl collectl released /collectl/collectl-4.3.1/collectl-4.3.1.src.tar.gz

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Ok, thanks for the clarification. ;) I'll try to upload later today, Thanks for your patience -mark On Wed, Oct 31, 2018 at 4:21 AM Jan Schreiber jccs@users.sourceforge.net wrote: Hello Mark, what you are displaying is not a wait time but a wait time by IO, calculated in line 3432 in formatit.ph. The ever increasing values in /proc/diskstats need to have their delta calculated for every intervall, providing a cumulative IO time and the number of IOs. The first gets divided by the second, resulting...

  • Jan Schreiber Jan Schreiber posted a comment on discussion Open Discussion

    Hello Mark, what you are displaying is not a wait time but a wait time by IO, calculated in line 3432 in formatit.ph. The ever increasing values in /proc/diskstats need to have their delta calculated for every intervall, providing a cumulative IO time and the number of IOs. The first gets divided by the second, resulting in msec/IO, which is what we want. Dividing the IO time by the time intervall results in a disk busy aka utilization metric in percent. That is done in line 3435. So far my understanding...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    sorry, yes I will. But now that you bring it up I was also wondering if it SHOULD divide by the interval in both cases. rather in neither. those wait times are in msec. so lets say you had had a 10 sec interval during with the wait time was 100msec. then looked at a 1sec interval and it was 10msec or a 60 second interval and it was 600. how meaningful would those numbers be? would the wait/sec be more consistent since it would be independent of the interval? at this point I'm happy to do it either...

  • Jan Schreiber Jan Schreiber posted a comment on discussion Open Discussion

    Hello Mark, Hello Laurence, have you looked at the change? Does it get your blessing? Will you roll it out? Regards, Jan

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    hmm, since I don't divide by intSecs for the combined totals I honestly can't remember why I thought I needed to divide for the reads vs writes. but it sure seems to make sense what you're suggesting. just to get another set of eyeballs on this since he uses it a lot more than me now I'd like to ask Laurence to weigh in if he has a little time. -mark On Tue, Sep 25, 2018 at 11:04 AM Jan Schreiber jccs@users.sourceforge.net wrote: Hello Mark, I think I found the "problem"... In /usr/share/collectl/formatit.ph...

  • Jan Schreiber Jan Schreiber posted a comment on discussion Open Discussion

    Hello Mark, I think I found the "problem"... In /usr/share/collectl/formatit.ph 5329 $dskRecord=sprintf("%s$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS$SEP%$FS", $dskName, $dskRead[$i]/$intSecs, $dskReadMrg[$i]/$intSecs, $dskReadKB[$i]/$intSecs, $dskWaitR[$i]/$intSecs, $dskWrite[$i]/$intSecs, $dskWriteMrg[$i]/$intSecs, $dskWriteKB[$i]/$intSecs, $dskWaitW[$i]/$intSecs, $dskRqst[$i], $dskQueLen[$i], $dskWait[$i], $dskSvcTime[$i], $dskUtil[$i]); Has...

  • Mark Seger Mark Seger posted a comment on discussion Help

    wow, that is weird. and yes, there is a way to do logging but I haven't looked this code in ages adn know virtually nothing about graphite. if you look at the header of graphite.ph you'll see this: debug 1 - print Var, Units and Values 2 - only print sent 'changed' Var/Units/Vales 4 - not used 8 - do not open/use socket (typically used with other flags) 16 - print socket open/close info which allows you to pass a debug mask to teh graphite module, so using your example above: I just ran this on my...

  • Adam S Adam S posted a comment on discussion Help

    I'm running into a weird issue with collectl and having it send data to graphite. I'm currently running collectl with the following flags: -D --tworaw --rawtoo -sbcCdDEijJmMnNstTyYZ --export graphite,10.0.0.14 On the graphite side, 90% of the time everything is working as expected, however, 10% of the time i'm seeing graphite creating weird folders ex: host/f/q/d/cpuinfo where i would expect it to create host/f/q/d/n/cpuinfo is there any way to log what collectl is sending to graphite so that i can...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    For what it's worth, dskWaitR and dskWaitW are only written to twice, like this. The first time to initalize and the second to actually do the computation. Here are the results of a grep for dskWaitR in formatit.ph. $dskWaitR[$i]=$dskWaitW[$i]=0; $dskWaitR[$dskIndex]= $dskRead[$dskIndex] ? ($dskReadTicks[$dskIndex]/$dskRead[$dskIndex]) : 0; then it's only written twice, first in the -sD teminal output and the second time in the -P format $dskRead[$i]/$intSecs, $dskReadMrg[$i]/$intSecs, $dskReadKB[$i]/$intSecs,...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Not really sure what to say here as I haven't changed that code in many years and it's always seemed to work fine in the past. Collectl is a pretty dumb script as all it does is read from /proc, in this case /proc/diskstats, and report the changes between samples, divided by the interval. As for comparing with iostat usually the numbers agree but sometimes they can differ based on various rounding errors. Tough in this case I don't think they should make a difference. I can't tell from your data...

  • Jan Schreiber Jan Schreiber posted a comment on discussion Open Discussion

    Hello Mark, noticed an issue with the following version of collectl: collectl V4.3.0-1 (zlib:2.008,HiRes:1.9711) I started fio with 16 threads reading 4 KByte from /dev/sdb directly. There is a difference in the output of running -sD interactively and as a replay for the WaitR metric. The same is true for for the write metrics. Running collectl interactively results in: DISK STATISTICS (/sec) <---------reads---------------><---------writes--------------><--------averages--------> Pct Name KBytes...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    glad to hear you're enjoying collectl... The big challenge with collect is in order to show the aggregate of all disk traffic and at the same time not double count like some other tools do, collectl needs to know which disks are real. for example you'd never want to include the dm stats with the disks it maps or you get a double count. If you look at the very end of collectl.conf, you'll see a perl pattern that is used to identify those disks disks that ARE to be included (and also embedded in collectl...

  • Curt Bruns Curt Bruns posted a comment on discussion Open Discussion

    Hi There, I have 16 NVMe drives (nvme[0-15]n1) in a system. Running collectl w/o any arguments or just -sd doesn't report any disk stats for NVMe10-15. However if I use -sD, it reports the stats correctly. In other words, if I'm doing I/O to just /dev/nvme10n1, collectl -sd won't show any I/O, but -sD will show the IO. Love the tool - thanks for the great work! Curt Here is a link to a folder with small playback file with I/O going to nvme[10-15]n1 that demonstrates the issue. [https://drive.google.com/drive/folders/1mnANmrNMtMCG3nBNOlbuPE9tjtGhDfrH?usp=sharing]...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Sorry if this is a little confusing, but by default, network stats are in K so when you see something like 123K it's really M and when you see M it's really G.

  • KM5 KM5 posted a comment on discussion Open Discussion

    Using latest version- Seeing that network stats are reported as xxxxK when it should be xxxxM. The Stats are already reported in KB (KBIn, KBOut). # Wed Mar 14 13:30:47 2018 Connected: 5 of 5 # <----CPU[HYPER]-----><-----------Memory-----------><----------Disks-----------><----------Network----------> #Host cpu sys inter ctxsw Free Buff Cach Inac Slab Map KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut ip-172-31-0-36 1 0 1658 2268 26G 2G 128G 120G 8G 17G 0 0 0 4 11 24 6 22 ip-172-31-13-39 1 0...

  • KM5 KM5 posted a comment on discussion Open Discussion

    Thanks, this fixed it!

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    aha! you're running V4.1.0 and according to the release notes it was added to 4.1.1 ;) try a newer version and hopefully that will fix is -mark On Thu, Mar 8, 2018 at 10:42 AM, KM5 km97402@users.sourceforge.net wrote: Running collectl - not seeing disk stats for nvme drives. [memsql@ip-172-31-7-171 ~]$ collectl waiting for 1 second sample... <----CPU[HYPER]-----><----------Disks-----------><----------Network----------> cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut 26 0 33102...

  • KM5 KM5 posted a comment on discussion Open Discussion

    Running collectl - not seeing disk stats for nvme drives. [memsql@ip-172-31-7-171 ~]$ collectl waiting for 1 second sample... #<----CPU[HYPER]-----><----------Disks-----------><----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut 26 0 33102 17143 0 0 0 0 40105 5004 170 2563 92 0 64719 1767 0 0 0 0 111761 13864 406 6210 17 0 18557 9819 0 0 0 0 13223 1673 57 803 Ouch! [memsql@ip-172-31-7-171 ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:4...

  • John R. Naylor John R. Naylor posted a comment on discussion Open Discussion

    Thanks for your advice, Mark. With your kind help, I managed to get this working. I simplified the DaemonCommands setting in /etc/collectl.conf to read as follows: DaemonCommands = -smnc -A server There was something about the default settings that generated error messages when invoked from the command line with the -D or --daemon option: $ collectl --daemon --export lexpr -mscn -A server print() on closed filehandle MSG at /usr/bin/collectl line 5058. print() on closed filehandle MSG at /usr/bin/collectl...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    Not sure without a closer look but not on Xmas. I'm not sure what your syntax will do because -D says to read switches from DaemonCommands line in collectl.conf. try putting -A server in there and and /etc/init.d/collectl restart. Then try connecting with client.pl. actually you might need --export lexpr too. Start with that and see what happens -mark On Dec 25, 2017 9:05 AM, "John R. Naylor" mejohnnaylor@users.sf.net wrote: Expected behaviour - Something like the following in the client's terminal...

  • John R. Naylor John R. Naylor posted a comment on discussion Open Discussion

    Expected behaviour - Something like the following in the client's terminal (leading #'s removed to avoid triggering MD): $ collectl -sc waiting for 1 second sample ... <--------CPU--------> cpu sys inter ctxsw 1 0 743 1413 1 0 661 1211 1 0 581 1054 1 0 679 1206 1 1 626 1157 0 0 707 1380 System & Collectl versions: Host OS: Ubuntu 17.04 $ collectl -v collectl V4.0.5-1 (zlib:2.069001,HiRes:1.9733) Steps to reproduce start the daemon... $ sudo collectl -D -A server -sc start the client... $ ./client.pl...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    I have not looked at the lustre module so can't saw what it does, but at the core of collectl processing is to call the record function which writes data to disk for later processing. I believe it's here where d4 is processed so maybe it's not sing called? If you know perl, put a copy of the lustre code in your local directory a d stick in some print statements, that's the best I can offer. Maybe send an email to Peter? -mark On Dec 24, 2017 1:22 AM, "Jeff Johnson" aeonjeff@users.sf.net wrote: Using...

  • Jeff Johnson Jeff Johnson posted a comment on discussion Open Discussion

    Using -d4 allows debugging down to the raw data fetched from /proc but -d4 doesn't work when importing a module. The module import supercedes or overrides something so -d4 won't show anything. collectl -d4 -c4 --import lustreMDS,s How would I get the -d4 style debugging output when importing a module? Note: collectl works fine if I run it without importing a module. It's the module import that appears to be broken somehow. Thanks (and merry Christmas)

  • collectl collectl released /collectl/collectl-4.3.0/collectl-4.3.0.src.tar.gz

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    my go to method for debugging almost anything is to run collectl with -d4. it will print out the raw data as it's collected from /proc and everywhere else. try it with something simple like collectl -sc -d4 -c1 just to get the feel of it. when you do collect 1 sample as indicated by c1, you'll actually see 2 samples read in since collectl has to 'prime the pump' with base values so to speak -mark On Wed, Dec 13, 2017 at 11:50 PM, Jeff Johnson aeonjeff@users.sf.net wrote: Greetings, I am trying to...

  • Jeff Johnson Jeff Johnson posted a comment on discussion Open Discussion

    Greetings, I am trying to figure out why a combination of collectl and the collectl-lustre (by Peter Piela) are now printing blank lines instead of meaningful (or erroneous) data. Collectl with that imported module worked historically. I've tried reviewing the output from perl Debug::DumpTrace (perl -d:DumpTrace collectl --import lustreOSS,s) and I am unable to discern where thingsgo off the rails. If I run collectl --import lustreOSS,s I get this: ***waiting for 1 second sample... Ouch!* If you...

  • Mark Seger Mark Seger posted a comment on discussion Open Discussion

    sorry for the delay actually the answer is a mix of yes and sort of. Way back when I first wrote collectl I added the -si switch to report inode consumption and by including --verbose you'll see this: mjs@blkjak:~/qa/smoke$ collectl -si --verbose waiting for 1 second sample... INODE SUMMARY Dentries File Handles Inodes Number Unused Alloc MaxPct Number 3067472 655771 2312K 94.37 2904K 3067469 655771 2312K 94.37 2904K 3067469 655771 2312K 94.37 2904K however I've spend almost no time playing with...

<< < 1 2 >