linux - Split a variable table in python

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
After call lsof Im looking the generic way to split every row to get in a string each cell of the table, the problem came because each time the command is called the size of every column can change.
COMMAND     PID       USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
init          1       root  cwd       DIR                8,1      4096          2 /
kthreadd      2       root  txt   unknown                                         /proc/2/exe
kjournald    42       root  txt   unknown                                         /proc/42/exe
udevd        77       root  cwd       DIR                8,1      4096          2 /
udevd        77       root  txt       REG                8,1    133176     139359 /sbin/udevd
flush-8:1 26221       root  cwd       DIR                8,1      4096          2 /
flush-8:1 26221       root  rtd       DIR                8,1      4096          2 /
flush-8:1 26221       root  txt   unknown                                         /proc/26221/exe
sudo      26228       root    5u     unix 0xfff999002579d3c0       0t0     515611 socket
python    30077       root    2u      CHR                1,3       0t0        700 /dev/null
                It's possible for the command to have a space in the name, so it's not safe to just .split it. Perhaps you can use the headings to discover the field widths.
– John La Rooy
                Dec 3, 2013 at 11:44
Instead of parsing lsof command output, install the psutil module instead - it also has the advantage of being cross-platform.
import psutil
def get_all_files():
    files = set()
    for proc in psutil.process_iter():
            files.update(proc.get_open_files())
        except Exception: # probably don't have permission to get the files
    return files
print get_all_files()
# set([openfile(path='/opt/google/chrome/locales/en-GB.pak', fd=28), openfile(path='/home/jon/.config/google-chrome/Default/Session Storage/000789.log', fd=95), openfile(path='/proc/2414/mounts', fd=8) ... ]
You can then adapt this to include the parent process and other information, eg:
import psutil
for proc in psutil.process_iter():
        fids = proc.get_open_files()
    except Exception:
        continue
    for fid in fids:
        #print dir(proc)
        print proc.name, proc.pid, proc.username, fid.path
#gnome-settings-daemon 2147 jon /proc/2147/mounts
#pulseaudio 2155 jon /home/jon/.config/pulse/2f6a9045c2bc8db6bf32b2d7517969bf-device-volumes.tdb
#pulseaudio 2155 jon /home/jon/.config/pulse/2f6a9045c2bc8db6bf32b2d7517969bf-stream-volumes.tdb
                As i see psutil return regular files opened by process, I want all files opened in the system.
– John Lapoya
                Dec 3, 2013 at 13:32
                @JohnSnow okay... but running lsof on my machine returns 26,005 lines, of which, a load are all permission denied and other messages... at least the above filters it down to regular files (you can also retrieve network resources if wanted) from processes the program has rights to...
– Jon Clements
                Dec 3, 2013 at 13:36
You know that column labels are right aligned except for the first and last. Hence you can extract the column borders from the ending of the column labels (equivalent to: from the beginning of whitespace between adjacent column labels).
import re
# assuming input_file to be a file-like object
header = input_file.next()
borders = [match.start() for match in re.finditer(r'\s+', header)]
second_to_third_border = borders[1]
borders = borders[1:-1] # delete the first and last because not right-aligned
for line in input_file:
    first_to_second_border = line[:second_to_third_border].rfind(' ')
    actual_borders = [0, first_to_second_border] + borders + [len(line)]
    dset = []
    for (s, e) in zip(actual_borders[:-1], actual_borders[1:]):
        dset.append(line[s:e].strip())
    print dset
Concerning the first column:

You can search for the border between first and second column on each line. Search backwards for whitespace from the border between columns two and three.
You should do backwards because, as mentioned in the comments above, the command might contain spaces - the PID certainly not so.
Concerning the last column:

The column stretches from the border between the second-last and last to the end of the given line.
Example:
from StringIO import StringIO
input_file = StringIO('''\
COMMAND     PID       USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
init          1       root  cwd       DIR                8,1      4096          2 /
kthreadd      2       root  txt   unknown                                         /proc/2/exe
kjournald    42       root  txt   unknown                                         /proc/42/exe
prints  
['init', '1', 'root', 'cwd', 'DIR', '8,1', '4096', '2', '/']
['kthreadd', '2', 'root', 'txt', 'unknown', '', '', '', '/proc/2/exe']
['kjournald', '42', 'root', 'txt', 'unknown', '', '', '', '/proc/42/exe']
Addressing the 'spaces in NAME problem'
For addressing the issue about possible spaces in NAME column mentioned in the comments I can propose the following solution. It's based on my desire to keep it simple and on the fact that only the last column could have spaces. 
The algorithm is simple:
1. Find the position where the last columns start - I use the header NAME starting position
2. Cut the line after that position> What you just cut is the value of the NAME column
3. split() the rest of the line. 
Here is the code:
import fileinput
header_limits = dict()
records = list()
input = fileinput.input()
header_line = None
for line in input:
    if not header_line:
        header_line = line
        col_names = header_line.split()
        for col_name in col_names:
            header_limits[col_name] = header_line.find(col_name)
        continue
    else:
        record = dict()
        record['NAME'] = line[header_limits['NAME']:].strip()
        line = line[:header_limits['NAME'] - 1]
        record.update(zip(col_names, line.split()))
        records.append(record)
for record in records:
    print "%s\n" % repr(record)
The result is a list of dictionaries. Every dictionary correspond to one line of the lsof output.
This is interesting task showing the power of python for everyday tasks. 
Any way, if it's possible I would prefer the use of some python library as the proposed psutils
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.